Background and Overview

DataCamp offers several interactive courses related to R Programming. While much of it is review, it is always helpful to see other perspectives on material. As well, DataCamp has some interesting materials on packages that I want to learn better (ggplot2, dplyr, ggvis, etc.). This document summarizes a few key insights from:

An additional document will be maintained for several of the more statistical areas of the Data Camp offering, as well as for the few courses offered in Python.

Key Insights and Findings

Introduction to R and Intermediate R

There are a few nuggest from within these beginning modules, including:

Generic statements

  • factor(x, ordered=TRUE, levels=c(myLevels)) creates ordinal factors (e.g., a > b > c)
  • subset(a, b) is functionally the same as a[a$b, ] but easier to read
  • & looks at each element while && looks only at the first element (same for | and ||)
  • Inside of a for loop, break kills the loop entirely while next moves back to the top for the next item
  • args(function) shows the arguments (with defaults) for function
  • search() shows the current search path (all auto-load packages and all attached packages)
  • cat(“expression”) will print the expression or direct it to a file; this is a way to allow and to take effect in a print statement
  • unique() keeps only the non-duplicated elements of a vector
  • unlist() converts a list back to a vector, somewhat similar to as.vector() on a matrix
  • sort() will sort a vector, but not a data frame
  • rep(a, times=m, each=n) replicates each element of a n times, and then the whole string m times
  • append(x, values, after=length(x)) will insert values in to vector x after point after
  • rev() reverses a vector
  • Inside a grep, “\1” captures what is inside the ()

Apply usages

  • lapply() operates on a vector/list and always returns a list
  • sapply() is lapply but converted to a vector/array when possible (same as lapply if not possible); if USE.NAMES=FALSE then the vector will be unnamed, though the default is USE.NAMES=TRUE for a named vector
  • vapply(X, FUN, FUN.VALUE, … , USE.NAMES=TRUE) is safer than sapply in that you specify what type of vector each iteration should produce; e.g., FUN.VALUE=character(1) or FUN.VALUE=numeric(3), with an error if the vector produced by an iteration is not exactly that

Dates and times

  • Sys.Date() grabs the system date as class “Date”, with units of days
  • Sys.time() grabs the system time as class “POSIXct”, with units of seconds
  • Sys.timezone() shows the system timezone
  • Years are formatted as %Y (4-digit) or %y (2-digit)
  • Months are formatted as %m (2-digit) or %B (full character) or %b (3-character)
  • Days are formatted as %d (2-digit)
  • Weekdays are formatted as %A (full name) or %a (partial name)
  • Times include %H (24-hour hour), %M (minutes), %S (seconds)
  • ?strptime will provide a lot more detail on the formats

Below is some sample code showing examples for the generic statements:

# Factors
xRaw = c("High", "High", "Low", "Low", "Medium", "Very High", "Low")

xFactorNon = factor(xRaw, levels=c("Low", "Medium", "High", "Very High"))
xFactorNon
## [1] High      High      Low       Low       Medium    Very High Low      
## Levels: Low Medium High Very High
xFactorNon[xFactorNon == "High"] > xFactorNon[xFactorNon == "Low"][1]
## Warning in Ops.factor(xFactorNon[xFactorNon == "High"],
## xFactorNon[xFactorNon == : '>' not meaningful for factors
## [1] NA NA
xFactorOrder = factor(xRaw, ordered=TRUE, levels=c("Low", "Medium", "High", "Very High"))
xFactorOrder
## [1] High      High      Low       Low       Medium    Very High Low      
## Levels: Low < Medium < High < Very High
xFactorOrder[xFactorOrder == "High"] > xFactorOrder[xFactorOrder == "Low"][1]
## [1] TRUE TRUE
# Subsets
data(mtcars)
subset(mtcars, mpg>=25)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
identical(subset(mtcars, mpg>=25), mtcars[mtcars$mpg>=25, ])
## [1] TRUE
subset(mtcars, mpg>25, select=c("mpg", "cyl", "disp"))
##                 mpg cyl  disp
## Fiat 128       32.4   4  78.7
## Honda Civic    30.4   4  75.7
## Toyota Corolla 33.9   4  71.1
## Fiat X1-9      27.3   4  79.0
## Porsche 914-2  26.0   4 120.3
## Lotus Europa   30.4   4  95.1
# & and && (same as | and ||)
compA <- c(2, 3, 4, 1, 2, 3)
compB <- c(1, 2, 3, 4, 5, 6)
(compA > compB) & (compA + compB < 6)
## [1]  TRUE  TRUE FALSE FALSE FALSE FALSE
(compA > compB) | (compA + compB < 6)
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE
(compA > compB) && (compA + compB < 6)
## [1] TRUE
(compA > compB) || (compA + compB < 6)
## [1] TRUE
# Loops and cat()
# for (a in b) {
#     do stuff
#     if (exitCond) { break }
#     if (nextCond) { next }
#     do some more stuff
# }
for (myVal in compA*compB) {
    print(paste0("myVal is: ", myVal))
    if ((myVal %% 3) == 0) { cat("Divisible by 3, not happy about that\n\n"); next }
    print("That is not divisible by 3")
    if ((myVal %% 5) == 0) { cat("Exiting due to divisible by 5 but not divisible by 3\n\n"); break }
    cat("Onwards and upwards\n\n")
}
## [1] "myVal is: 2"
## [1] "That is not divisible by 3"
## Onwards and upwards
## 
## [1] "myVal is: 6"
## Divisible by 3, not happy about that
## 
## [1] "myVal is: 12"
## Divisible by 3, not happy about that
## 
## [1] "myVal is: 4"
## [1] "That is not divisible by 3"
## Onwards and upwards
## 
## [1] "myVal is: 10"
## [1] "That is not divisible by 3"
## Exiting due to divisible by 5 but not divisible by 3
# args() and search()
args(plot.default)
## function (x, y = NULL, type = "p", xlim = NULL, ylim = NULL, 
##     log = "", main = NULL, sub = NULL, xlab = NULL, ylab = NULL, 
##     ann = par("ann"), axes = TRUE, frame.plot = axes, panel.first = NULL, 
##     panel.last = NULL, asp = NA, ...) 
## NULL
search()
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"
# unique()
compA
## [1] 2 3 4 1 2 3
unique(compA)
## [1] 2 3 4 1
# unlist()
listA <- as.list(compA)
unlist(listA)
## [1] 2 3 4 1 2 3
identical(compA, unlist(listA))
## [1] TRUE
# sort()
sort(mtcars$mpg)
##  [1] 10.4 10.4 13.3 14.3 14.7 15.0 15.2 15.2 15.5 15.8 16.4 17.3 17.8 18.1
## [15] 18.7 19.2 19.2 19.7 21.0 21.0 21.4 21.4 21.5 22.8 22.8 24.4 26.0 27.3
## [29] 30.4 30.4 32.4 33.9
sort(mtcars$mpg, decreasing=TRUE)
##  [1] 33.9 32.4 30.4 30.4 27.3 26.0 24.4 22.8 22.8 21.5 21.4 21.4 21.0 21.0
## [15] 19.7 19.2 19.2 18.7 18.1 17.8 17.3 16.4 15.8 15.5 15.2 15.2 15.0 14.7
## [29] 14.3 13.3 10.4 10.4
# rep()
rep(1:6, times=2)  # 1:6 followed by 1:6
##  [1] 1 2 3 4 5 6 1 2 3 4 5 6
rep(1:6, each=2)  # 1 1 2 2 3 3 4 4 5 5 6 6
##  [1] 1 1 2 2 3 3 4 4 5 5 6 6
rep(1:6, times=2, each=3)  # 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 repeated twice (each comes first)
##  [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6
## [36] 6
rep(1:6, times=6:1)  # 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 4 4 4 5 5 6
##  [1] 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 4 4 4 5 5 6
# append()
myWords <- c("The", "cat", "in", "the", "hat")
paste(append(myWords, c("is", "fun", "to", "read")), collapse=" ")
## [1] "The cat in the hat is fun to read"
paste(append(myWords, "funny", 4), collapse=" ")
## [1] "The cat in the funny hat"
# grep("//1")
sampMsg <- "This is from myname@subdomain.mydomain.com again"
gsub("(^.*\\w*[a-zA-Z0-9]+@)([a-zA-Z0-9]+\\.[a-zA-Z0-9.]+)(.*$)", "\\1", sampMsg)
## [1] "This is from myname@"
gsub("(^.*\\w*[a-zA-Z0-9]+@)([a-zA-Z0-9]+\\.[a-zA-Z0-9.]+)(.*$)", "\\2", sampMsg)
## [1] "subdomain.mydomain.com"
gsub("(^.*\\w*[a-zA-Z0-9]+@)([a-zA-Z0-9]+\\.[a-zA-Z0-9.]+)(.*$)", "\\3", sampMsg)
## [1] " again"
# rev()
compA
## [1] 2 3 4 1 2 3
rev(compA)
## [1] 3 2 1 4 3 2

Below is some sample code showing examples for the apply statements:

# lapply
args(lapply)
## function (X, FUN, ...) 
## NULL
lapply(1:5, FUN=sqrt)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.236068
lapply(1:5, FUN=function(x, y=2) { c(x=x, y=y, pow=x^y) }, y=3)
## [[1]]
##   x   y pow 
##   1   3   1 
## 
## [[2]]
##   x   y pow 
##   2   3   8 
## 
## [[3]]
##   x   y pow 
##   3   3  27 
## 
## [[4]]
##   x   y pow 
##   4   3  64 
## 
## [[5]]
##   x   y pow 
##   5   3 125
lapply(1:5, FUN=function(x, y=2) { if (x <= 3) {c(x=x, y=y, pow=x^y) } else { c(pow=x^y) } }, y=3)
## [[1]]
##   x   y pow 
##   1   3   1 
## 
## [[2]]
##   x   y pow 
##   2   3   8 
## 
## [[3]]
##   x   y pow 
##   3   3  27 
## 
## [[4]]
## pow 
##  64 
## 
## [[5]]
## pow 
## 125
# sapply (defaults to returning a named vector/array if possible; is lapply otherwise)
args(sapply)
## function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) 
## NULL
args(simplify2array)
## function (x, higher = TRUE) 
## NULL
sapply(1:5, FUN=sqrt)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
sapply(1:5, FUN=function(x, y=2) { c(x=x, y=y, pow=x^y) }, y=3)
##     [,1] [,2] [,3] [,4] [,5]
## x      1    2    3    4    5
## y      3    3    3    3    3
## pow    1    8   27   64  125
sapply(1:5, FUN=function(x, y=2) { if (x <= 3) {c(x=x, y=y, pow=x^y) } else { c(pow=x^y) } }, y=3)
## [[1]]
##   x   y pow 
##   1   3   1 
## 
## [[2]]
##   x   y pow 
##   2   3   8 
## 
## [[3]]
##   x   y pow 
##   3   3  27 
## 
## [[4]]
## pow 
##  64 
## 
## [[5]]
## pow 
## 125
# vapply (tells sapply exactly what should be returned; errors out otherwise)
args(vapply)
## function (X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE) 
## NULL
vapply(1:5, FUN=sqrt, FUN.VALUE=numeric(1))
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
vapply(1:5, FUN=function(x, y=2) { c(x=x, y=y, pow=x^y) }, FUN.VALUE=numeric(3), y=3)
##     [,1] [,2] [,3] [,4] [,5]
## x      1    2    3    4    5
## y      3    3    3    3    3
## pow    1    8   27   64  125

Below is some sample code for handing dates and times in R:

Sys.Date()
## [1] "2017-01-24"
Sys.time()
## [1] "2017-01-24 09:21:42 CST"
args(strptime)
## function (x, format, tz = "") 
## NULL
rightNow <- as.POSIXct(Sys.time())
format(rightNow, "%Y**%M-%d %H hours and %M minutes", usetz=TRUE)
## [1] "2017**21-24 09 hours and 21 minutes CST"
lastChristmasNoon <- as.POSIXct("2015-12-25 12:00:00", format="%Y-%m-%d %X")
rightNow - lastChristmasNoon
## Time difference of 395.8901 days
nextUMHomeGame <- as.POSIXct("16/SEP/3 12:00:00", format="%y/%b/%d %H:%M:%S", tz="America/Detroit")
nextUMHomeGame - rightNow
## Time difference of -142.9734 days
# Time zones available in R
OlsonNames()
##   [1] "Africa/Abidjan"                   "Africa/Accra"                    
##   [3] "Africa/Addis_Ababa"               "Africa/Algiers"                  
##   [5] "Africa/Asmara"                    "Africa/Asmera"                   
##   [7] "Africa/Bamako"                    "Africa/Bangui"                   
##   [9] "Africa/Banjul"                    "Africa/Bissau"                   
##  [11] "Africa/Blantyre"                  "Africa/Brazzaville"              
##  [13] "Africa/Bujumbura"                 "Africa/Cairo"                    
##  [15] "Africa/Casablanca"                "Africa/Ceuta"                    
##  [17] "Africa/Conakry"                   "Africa/Dakar"                    
##  [19] "Africa/Dar_es_Salaam"             "Africa/Djibouti"                 
##  [21] "Africa/Douala"                    "Africa/El_Aaiun"                 
##  [23] "Africa/Freetown"                  "Africa/Gaborone"                 
##  [25] "Africa/Harare"                    "Africa/Johannesburg"             
##  [27] "Africa/Juba"                      "Africa/Kampala"                  
##  [29] "Africa/Khartoum"                  "Africa/Kigali"                   
##  [31] "Africa/Kinshasa"                  "Africa/Lagos"                    
##  [33] "Africa/Libreville"                "Africa/Lome"                     
##  [35] "Africa/Luanda"                    "Africa/Lubumbashi"               
##  [37] "Africa/Lusaka"                    "Africa/Malabo"                   
##  [39] "Africa/Maputo"                    "Africa/Maseru"                   
##  [41] "Africa/Mbabane"                   "Africa/Mogadishu"                
##  [43] "Africa/Monrovia"                  "Africa/Nairobi"                  
##  [45] "Africa/Ndjamena"                  "Africa/Niamey"                   
##  [47] "Africa/Nouakchott"                "Africa/Ouagadougou"              
##  [49] "Africa/Porto-Novo"                "Africa/Sao_Tome"                 
##  [51] "Africa/Timbuktu"                  "Africa/Tripoli"                  
##  [53] "Africa/Tunis"                     "Africa/Windhoek"                 
##  [55] "America/Adak"                     "America/Anchorage"               
##  [57] "America/Anguilla"                 "America/Antigua"                 
##  [59] "America/Araguaina"                "America/Argentina/Buenos_Aires"  
##  [61] "America/Argentina/Catamarca"      "America/Argentina/ComodRivadavia"
##  [63] "America/Argentina/Cordoba"        "America/Argentina/Jujuy"         
##  [65] "America/Argentina/La_Rioja"       "America/Argentina/Mendoza"       
##  [67] "America/Argentina/Rio_Gallegos"   "America/Argentina/Salta"         
##  [69] "America/Argentina/San_Juan"       "America/Argentina/San_Luis"      
##  [71] "America/Argentina/Tucuman"        "America/Argentina/Ushuaia"       
##  [73] "America/Aruba"                    "America/Asuncion"                
##  [75] "America/Atikokan"                 "America/Atka"                    
##  [77] "America/Bahia"                    "America/Bahia_Banderas"          
##  [79] "America/Barbados"                 "America/Belem"                   
##  [81] "America/Belize"                   "America/Blanc-Sablon"            
##  [83] "America/Boa_Vista"                "America/Bogota"                  
##  [85] "America/Boise"                    "America/Buenos_Aires"            
##  [87] "America/Cambridge_Bay"            "America/Campo_Grande"            
##  [89] "America/Cancun"                   "America/Caracas"                 
##  [91] "America/Catamarca"                "America/Cayenne"                 
##  [93] "America/Cayman"                   "America/Chicago"                 
##  [95] "America/Chihuahua"                "America/Coral_Harbour"           
##  [97] "America/Cordoba"                  "America/Costa_Rica"              
##  [99] "America/Creston"                  "America/Cuiaba"                  
## [101] "America/Curacao"                  "America/Danmarkshavn"            
## [103] "America/Dawson"                   "America/Dawson_Creek"            
## [105] "America/Denver"                   "America/Detroit"                 
## [107] "America/Dominica"                 "America/Edmonton"                
## [109] "America/Eirunepe"                 "America/El_Salvador"             
## [111] "America/Ensenada"                 "America/Fort_Nelson"             
## [113] "America/Fort_Wayne"               "America/Fortaleza"               
## [115] "America/Glace_Bay"                "America/Godthab"                 
## [117] "America/Goose_Bay"                "America/Grand_Turk"              
## [119] "America/Grenada"                  "America/Guadeloupe"              
## [121] "America/Guatemala"                "America/Guayaquil"               
## [123] "America/Guyana"                   "America/Halifax"                 
## [125] "America/Havana"                   "America/Hermosillo"              
## [127] "America/Indiana/Indianapolis"     "America/Indiana/Knox"            
## [129] "America/Indiana/Marengo"          "America/Indiana/Petersburg"      
## [131] "America/Indiana/Tell_City"        "America/Indiana/Vevay"           
## [133] "America/Indiana/Vincennes"        "America/Indiana/Winamac"         
## [135] "America/Indianapolis"             "America/Inuvik"                  
## [137] "America/Iqaluit"                  "America/Jamaica"                 
## [139] "America/Jujuy"                    "America/Juneau"                  
## [141] "America/Kentucky/Louisville"      "America/Kentucky/Monticello"     
## [143] "America/Knox_IN"                  "America/Kralendijk"              
## [145] "America/La_Paz"                   "America/Lima"                    
## [147] "America/Los_Angeles"              "America/Louisville"              
## [149] "America/Lower_Princes"            "America/Maceio"                  
## [151] "America/Managua"                  "America/Manaus"                  
## [153] "America/Marigot"                  "America/Martinique"              
## [155] "America/Matamoros"                "America/Mazatlan"                
## [157] "America/Mendoza"                  "America/Menominee"               
## [159] "America/Merida"                   "America/Metlakatla"              
## [161] "America/Mexico_City"              "America/Miquelon"                
## [163] "America/Moncton"                  "America/Monterrey"               
## [165] "America/Montevideo"               "America/Montreal"                
## [167] "America/Montserrat"               "America/Nassau"                  
## [169] "America/New_York"                 "America/Nipigon"                 
## [171] "America/Nome"                     "America/Noronha"                 
## [173] "America/North_Dakota/Beulah"      "America/North_Dakota/Center"     
## [175] "America/North_Dakota/New_Salem"   "America/Ojinaga"                 
## [177] "America/Panama"                   "America/Pangnirtung"             
## [179] "America/Paramaribo"               "America/Phoenix"                 
## [181] "America/Port-au-Prince"           "America/Port_of_Spain"           
## [183] "America/Porto_Acre"               "America/Porto_Velho"             
## [185] "America/Puerto_Rico"              "America/Rainy_River"             
## [187] "America/Rankin_Inlet"             "America/Recife"                  
## [189] "America/Regina"                   "America/Resolute"                
## [191] "America/Rio_Branco"               "America/Rosario"                 
## [193] "America/Santa_Isabel"             "America/Santarem"                
## [195] "America/Santiago"                 "America/Santo_Domingo"           
## [197] "America/Sao_Paulo"                "America/Scoresbysund"            
## [199] "America/Shiprock"                 "America/Sitka"                   
## [201] "America/St_Barthelemy"            "America/St_Johns"                
## [203] "America/St_Kitts"                 "America/St_Lucia"                
## [205] "America/St_Thomas"                "America/St_Vincent"              
## [207] "America/Swift_Current"            "America/Tegucigalpa"             
## [209] "America/Thule"                    "America/Thunder_Bay"             
## [211] "America/Tijuana"                  "America/Toronto"                 
## [213] "America/Tortola"                  "America/Vancouver"               
## [215] "America/Virgin"                   "America/Whitehorse"              
## [217] "America/Winnipeg"                 "America/Yakutat"                 
## [219] "America/Yellowknife"              "Antarctica/Casey"                
## [221] "Antarctica/Davis"                 "Antarctica/DumontDUrville"       
## [223] "Antarctica/Macquarie"             "Antarctica/Mawson"               
## [225] "Antarctica/McMurdo"               "Antarctica/Palmer"               
## [227] "Antarctica/Rothera"               "Antarctica/South_Pole"           
## [229] "Antarctica/Syowa"                 "Antarctica/Troll"                
## [231] "Antarctica/Vostok"                "Arctic/Longyearbyen"             
## [233] "Asia/Aden"                        "Asia/Almaty"                     
## [235] "Asia/Amman"                       "Asia/Anadyr"                     
## [237] "Asia/Aqtau"                       "Asia/Aqtobe"                     
## [239] "Asia/Ashgabat"                    "Asia/Ashkhabad"                  
## [241] "Asia/Baghdad"                     "Asia/Bahrain"                    
## [243] "Asia/Baku"                        "Asia/Bangkok"                    
## [245] "Asia/Beirut"                      "Asia/Bishkek"                    
## [247] "Asia/Brunei"                      "Asia/Calcutta"                   
## [249] "Asia/Chita"                       "Asia/Choibalsan"                 
## [251] "Asia/Chongqing"                   "Asia/Chungking"                  
## [253] "Asia/Colombo"                     "Asia/Dacca"                      
## [255] "Asia/Damascus"                    "Asia/Dhaka"                      
## [257] "Asia/Dili"                        "Asia/Dubai"                      
## [259] "Asia/Dushanbe"                    "Asia/Gaza"                       
## [261] "Asia/Harbin"                      "Asia/Hebron"                     
## [263] "Asia/Ho_Chi_Minh"                 "Asia/Hong_Kong"                  
## [265] "Asia/Hovd"                        "Asia/Irkutsk"                    
## [267] "Asia/Istanbul"                    "Asia/Jakarta"                    
## [269] "Asia/Jayapura"                    "Asia/Jerusalem"                  
## [271] "Asia/Kabul"                       "Asia/Kamchatka"                  
## [273] "Asia/Karachi"                     "Asia/Kashgar"                    
## [275] "Asia/Kathmandu"                   "Asia/Katmandu"                   
## [277] "Asia/Khandyga"                    "Asia/Kolkata"                    
## [279] "Asia/Krasnoyarsk"                 "Asia/Kuala_Lumpur"               
## [281] "Asia/Kuching"                     "Asia/Kuwait"                     
## [283] "Asia/Macao"                       "Asia/Macau"                      
## [285] "Asia/Magadan"                     "Asia/Makassar"                   
## [287] "Asia/Manila"                      "Asia/Muscat"                     
## [289] "Asia/Nicosia"                     "Asia/Novokuznetsk"               
## [291] "Asia/Novosibirsk"                 "Asia/Omsk"                       
## [293] "Asia/Oral"                        "Asia/Phnom_Penh"                 
## [295] "Asia/Pontianak"                   "Asia/Pyongyang"                  
## [297] "Asia/Qatar"                       "Asia/Qyzylorda"                  
## [299] "Asia/Rangoon"                     "Asia/Riyadh"                     
## [301] "Asia/Saigon"                      "Asia/Sakhalin"                   
## [303] "Asia/Samarkand"                   "Asia/Seoul"                      
## [305] "Asia/Shanghai"                    "Asia/Singapore"                  
## [307] "Asia/Srednekolymsk"               "Asia/Taipei"                     
## [309] "Asia/Tashkent"                    "Asia/Tbilisi"                    
## [311] "Asia/Tehran"                      "Asia/Tel_Aviv"                   
## [313] "Asia/Thimbu"                      "Asia/Thimphu"                    
## [315] "Asia/Tokyo"                       "Asia/Ujung_Pandang"              
## [317] "Asia/Ulaanbaatar"                 "Asia/Ulan_Bator"                 
## [319] "Asia/Urumqi"                      "Asia/Ust-Nera"                   
## [321] "Asia/Vientiane"                   "Asia/Vladivostok"                
## [323] "Asia/Yakutsk"                     "Asia/Yekaterinburg"              
## [325] "Asia/Yerevan"                     "Atlantic/Azores"                 
## [327] "Atlantic/Bermuda"                 "Atlantic/Canary"                 
## [329] "Atlantic/Cape_Verde"              "Atlantic/Faeroe"                 
## [331] "Atlantic/Faroe"                   "Atlantic/Jan_Mayen"              
## [333] "Atlantic/Madeira"                 "Atlantic/Reykjavik"              
## [335] "Atlantic/South_Georgia"           "Atlantic/St_Helena"              
## [337] "Atlantic/Stanley"                 "Australia/ACT"                   
## [339] "Australia/Adelaide"               "Australia/Brisbane"              
## [341] "Australia/Broken_Hill"            "Australia/Canberra"              
## [343] "Australia/Currie"                 "Australia/Darwin"                
## [345] "Australia/Eucla"                  "Australia/Hobart"                
## [347] "Australia/LHI"                    "Australia/Lindeman"              
## [349] "Australia/Lord_Howe"              "Australia/Melbourne"             
## [351] "Australia/North"                  "Australia/NSW"                   
## [353] "Australia/Perth"                  "Australia/Queensland"            
## [355] "Australia/South"                  "Australia/Sydney"                
## [357] "Australia/Tasmania"               "Australia/Victoria"              
## [359] "Australia/West"                   "Australia/Yancowinna"            
## [361] "Brazil/Acre"                      "Brazil/DeNoronha"                
## [363] "Brazil/East"                      "Brazil/West"                     
## [365] "Canada/Atlantic"                  "Canada/Central"                  
## [367] "Canada/East-Saskatchewan"         "Canada/Eastern"                  
## [369] "Canada/Mountain"                  "Canada/Newfoundland"             
## [371] "Canada/Pacific"                   "Canada/Saskatchewan"             
## [373] "Canada/Yukon"                     "CET"                             
## [375] "Chile/Continental"                "Chile/EasterIsland"              
## [377] "CST6CDT"                          "Cuba"                            
## [379] "EET"                              "Egypt"                           
## [381] "Eire"                             "EST"                             
## [383] "EST5EDT"                          "Etc/GMT"                         
## [385] "Etc/GMT-0"                        "Etc/GMT-1"                       
## [387] "Etc/GMT-10"                       "Etc/GMT-11"                      
## [389] "Etc/GMT-12"                       "Etc/GMT-13"                      
## [391] "Etc/GMT-14"                       "Etc/GMT-2"                       
## [393] "Etc/GMT-3"                        "Etc/GMT-4"                       
## [395] "Etc/GMT-5"                        "Etc/GMT-6"                       
## [397] "Etc/GMT-7"                        "Etc/GMT-8"                       
## [399] "Etc/GMT-9"                        "Etc/GMT+0"                       
## [401] "Etc/GMT+1"                        "Etc/GMT+10"                      
## [403] "Etc/GMT+11"                       "Etc/GMT+12"                      
## [405] "Etc/GMT+2"                        "Etc/GMT+3"                       
## [407] "Etc/GMT+4"                        "Etc/GMT+5"                       
## [409] "Etc/GMT+6"                        "Etc/GMT+7"                       
## [411] "Etc/GMT+8"                        "Etc/GMT+9"                       
## [413] "Etc/GMT0"                         "Etc/Greenwich"                   
## [415] "Etc/UCT"                          "Etc/Universal"                   
## [417] "Etc/UTC"                          "Etc/Zulu"                        
## [419] "Europe/Amsterdam"                 "Europe/Andorra"                  
## [421] "Europe/Athens"                    "Europe/Belfast"                  
## [423] "Europe/Belgrade"                  "Europe/Berlin"                   
## [425] "Europe/Bratislava"                "Europe/Brussels"                 
## [427] "Europe/Bucharest"                 "Europe/Budapest"                 
## [429] "Europe/Busingen"                  "Europe/Chisinau"                 
## [431] "Europe/Copenhagen"                "Europe/Dublin"                   
## [433] "Europe/Gibraltar"                 "Europe/Guernsey"                 
## [435] "Europe/Helsinki"                  "Europe/Isle_of_Man"              
## [437] "Europe/Istanbul"                  "Europe/Jersey"                   
## [439] "Europe/Kaliningrad"               "Europe/Kiev"                     
## [441] "Europe/Lisbon"                    "Europe/Ljubljana"                
## [443] "Europe/London"                    "Europe/Luxembourg"               
## [445] "Europe/Madrid"                    "Europe/Malta"                    
## [447] "Europe/Mariehamn"                 "Europe/Minsk"                    
## [449] "Europe/Monaco"                    "Europe/Moscow"                   
## [451] "Europe/Nicosia"                   "Europe/Oslo"                     
## [453] "Europe/Paris"                     "Europe/Podgorica"                
## [455] "Europe/Prague"                    "Europe/Riga"                     
## [457] "Europe/Rome"                      "Europe/Samara"                   
## [459] "Europe/San_Marino"                "Europe/Sarajevo"                 
## [461] "Europe/Simferopol"                "Europe/Skopje"                   
## [463] "Europe/Sofia"                     "Europe/Stockholm"                
## [465] "Europe/Tallinn"                   "Europe/Tirane"                   
## [467] "Europe/Tiraspol"                  "Europe/Uzhgorod"                 
## [469] "Europe/Vaduz"                     "Europe/Vatican"                  
## [471] "Europe/Vienna"                    "Europe/Vilnius"                  
## [473] "Europe/Volgograd"                 "Europe/Warsaw"                   
## [475] "Europe/Zagreb"                    "Europe/Zaporozhye"               
## [477] "Europe/Zurich"                    "GB"                              
## [479] "GB-Eire"                          "GMT"                             
## [481] "GMT-0"                            "GMT+0"                           
## [483] "GMT0"                             "Greenwich"                       
## [485] "Hongkong"                         "HST"                             
## [487] "Iceland"                          "Indian/Antananarivo"             
## [489] "Indian/Chagos"                    "Indian/Christmas"                
## [491] "Indian/Cocos"                     "Indian/Comoro"                   
## [493] "Indian/Kerguelen"                 "Indian/Mahe"                     
## [495] "Indian/Maldives"                  "Indian/Mauritius"                
## [497] "Indian/Mayotte"                   "Indian/Reunion"                  
## [499] "Iran"                             "Israel"                          
## [501] "Jamaica"                          "Japan"                           
## [503] "Kwajalein"                        "Libya"                           
## [505] "MET"                              "Mexico/BajaNorte"                
## [507] "Mexico/BajaSur"                   "Mexico/General"                  
## [509] "MST"                              "MST7MDT"                         
## [511] "Navajo"                           "NZ"                              
## [513] "NZ-CHAT"                          "Pacific/Apia"                    
## [515] "Pacific/Auckland"                 "Pacific/Bougainville"            
## [517] "Pacific/Chatham"                  "Pacific/Chuuk"                   
## [519] "Pacific/Easter"                   "Pacific/Efate"                   
## [521] "Pacific/Enderbury"                "Pacific/Fakaofo"                 
## [523] "Pacific/Fiji"                     "Pacific/Funafuti"                
## [525] "Pacific/Galapagos"                "Pacific/Gambier"                 
## [527] "Pacific/Guadalcanal"              "Pacific/Guam"                    
## [529] "Pacific/Honolulu"                 "Pacific/Johnston"                
## [531] "Pacific/Kiritimati"               "Pacific/Kosrae"                  
## [533] "Pacific/Kwajalein"                "Pacific/Majuro"                  
## [535] "Pacific/Marquesas"                "Pacific/Midway"                  
## [537] "Pacific/Nauru"                    "Pacific/Niue"                    
## [539] "Pacific/Norfolk"                  "Pacific/Noumea"                  
## [541] "Pacific/Pago_Pago"                "Pacific/Palau"                   
## [543] "Pacific/Pitcairn"                 "Pacific/Pohnpei"                 
## [545] "Pacific/Ponape"                   "Pacific/Port_Moresby"            
## [547] "Pacific/Rarotonga"                "Pacific/Saipan"                  
## [549] "Pacific/Samoa"                    "Pacific/Tahiti"                  
## [551] "Pacific/Tarawa"                   "Pacific/Tongatapu"               
## [553] "Pacific/Truk"                     "Pacific/Wake"                    
## [555] "Pacific/Wallis"                   "Pacific/Yap"                     
## [557] "Poland"                           "Portugal"                        
## [559] "PRC"                              "PST8PDT"                         
## [561] "ROC"                              "ROK"                             
## [563] "Singapore"                        "Turkey"                          
## [565] "UCT"                              "Universal"                       
## [567] "US/Alaska"                        "US/Aleutian"                     
## [569] "US/Arizona"                       "US/Central"                      
## [571] "US/East-Indiana"                  "US/Eastern"                      
## [573] "US/Hawaii"                        "US/Indiana-Starke"               
## [575] "US/Michigan"                      "US/Mountain"                     
## [577] "US/Pacific"                       "US/Pacific-New"                  
## [579] "US/Samoa"                         "UTC"                             
## [581] "VERSION"                          "W-SU"                            
## [583] "WET"                              "Zulu"
# From ?strptime (excerpted)
#
# ** General formats **
# %c Date and time. Locale-specific on output, "%a %b %e %H:%M:%S %Y" on input.
# %F Equivalent to %Y-%m-%d (the ISO 8601 date format).
# %T Equivalent to %H:%M:%S.
# %D Date format such as %m/%d/%y: the C99 standard says it should be that exact format
# %x Date. Locale-specific on output, "%y/%m/%d" on input.
# %X Time. Locale-specific on output, "%H:%M:%S" on input.
# 
# ** Key Components **
# %y Year without century (00-99). On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19
# %Y Year with century
# %m Month as decimal number (01-12).
# %b Abbreviated month name in the current locale on this platform.
# %B Full month name in the current locale.
# %d Day of the month as decimal number (01-31).
# %e Day of the month as decimal number (1-31), with a leading space for a single-digit number.
# %a Abbreviated weekday name in the current locale on this platform.
# %A Full weekday name in the current locale.
# %H Hours as decimal number (00-23)
# %I Hours as decimal number (01-12)
# %M Minute as decimal number (00-59).
# %S Second as integer (00-61), allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).
# 
# ** Additional Options **
# %C Century (00-99): the integer part of the year divided by 100.
# 
# %g The last two digits of the week-based year (see %V). (Accepted but ignored on input.)
# %G The week-based year (see %V) as a decimal number. (Accepted but ignored on input.)
# 
# %h Equivalent to %b.
# 
# %j Day of year as decimal number (001-366).
# 
# %n Newline on output, arbitrary whitespace on input.
# 
# %p AM/PM indicator in the locale. Used in conjunction with %I and not with %H. An empty string in some locales (and the behaviour is undefined if used for input in such a locale).  Some platforms accept %P for output, which uses a lower-case version: others will output P.
# 
# %r The 12-hour clock time (using the locale's AM or PM). Only defined in some locales.
# 
# %R Equivalent to %H:%M.
# 
# %t Tab on output, arbitrary whitespace on input.
# 
# %u Weekday as a decimal number (1-7, Monday is 1).
# 
# %U Week of the year as decimal number (00-53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.
# 
# %V Week of the year as decimal number (01-53) as defined in ISO 8601. If the week (starting on Monday) containing 1 January has four or more days in the new year, then it is considered week 1. Otherwise, it is the last week of the previous year, and the next week is week 1. (Accepted but ignored on input.)
# 
# %w Weekday as decimal number (0-6, Sunday is 0).
# 
# %W Week of the year as decimal number (00-53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.
# 
# For input, only years 0:9999 are accepted.
# 
# %z Signed offset in hours and minutes from UTC, so -0800 is 8 hours behind UTC. Values up to +1400 are accepted as from R 3.1.1: previous versions only accepted up to +1200. (Standard only for output.)
# 
# %Z (Output only.) Time zone abbreviation as a character string (empty if not available). This may not be reliable when a time zone has changed abbreviations over the years.

Additionally, code from several practice examples is added:

set.seed(1608221310)

me <- 89
other_199 <- round(rnorm(199, mean=75.45, sd=11.03), 0)

mean(other_199)
## [1] 75.17588
sd(other_199)
## [1] 11.37711
desMeans <- c(72.275, 76.24, 74.5, 77.695)
desSD <- c(12.31, 11.22, 12.5, 12.53)

prevData <- c(rnorm(200, mean=72.275, sd=12.31), 
              rnorm(200, mean=76.24, sd=11.22), 
              rnorm(200, mean=74.5, sd=12.5),
              rnorm(200, mean=77.695, sd=12.53) 
              )
previous_4 <- matrix(data=prevData, ncol=4)

curMeans <- apply(previous_4, 2, FUN=mean)
curSD <- apply(previous_4, 2, FUN=sd)

previous_4 <- t(apply(previous_4, 1, FUN=function(x) { desMeans + (desSD / curSD) * (x - curMeans) } ))

apply(round(previous_4, 0), 2, FUN=mean)
## [1] 72.285 76.245 74.505 77.665
apply(round(previous_4, 0), 2, FUN=sd)
## [1] 12.35097 11.19202 12.49643 12.51744
previous_4 <- round(previous_4, 0)


# Merge me and other_199: my_class
my_class <- c(me, other_199)

# cbind() my_class and previous_4: last_5
last_5 <- cbind(my_class, previous_4)

# Name last_5 appropriately
nms <- paste0("year_", 1:5)
colnames(last_5) <- nms


# Build histogram of my_class
hist(my_class)

# Generate summary of last_5
summary(last_5)
##      year_1           year_2           year_3           year_4      
##  Min.   : 46.00   Min.   : 43.00   Min.   : 38.00   Min.   : 42.00  
##  1st Qu.: 68.00   1st Qu.: 63.75   1st Qu.: 69.00   1st Qu.: 65.75  
##  Median : 75.50   Median : 73.00   Median : 76.50   Median : 74.00  
##  Mean   : 75.25   Mean   : 72.28   Mean   : 76.25   Mean   : 74.50  
##  3rd Qu.: 83.25   3rd Qu.: 81.00   3rd Qu.: 84.25   3rd Qu.: 82.25  
##  Max.   :108.00   Max.   :108.00   Max.   :102.00   Max.   :113.00  
##      year_5      
##  Min.   : 38.00  
##  1st Qu.: 71.00  
##  Median : 78.00  
##  Mean   : 77.67  
##  3rd Qu.: 86.00  
##  Max.   :117.00
# Build boxplot of last_5
boxplot(last_5)

# How many grades in your class are higher than 75?
sum(my_class > 75)
## [1] 100
# How many students in your class scored strictly higher than you?
sum(my_class > me)
## [1] 17
# What's the proportion of grades below or equal to 64 in the last 5 years?
mean(last_5 <= 64)
## [1] 0.191
# Is your grade greater than 87 and smaller than or equal to 89?
me > 87 & me <= 89
## [1] TRUE
# Which grades in your class are below 60 or above 90?
my_class < 60 | my_class > 90
##   [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [23]  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [34]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [45]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [56] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
##  [78] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [89]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
## [100] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
## [133] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [155] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [166] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [177] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [199] FALSE FALSE
# What's the proportion of grades in your class that is average?
mean(my_class >= 70 & my_class <= 85)
## [1] 0.525
# How many students in the last 5 years had a grade of 80 or 90?
sum(last_5 %in% c(80, 90))
## [1] 44
# Define n_smart
n_smart <- sum(my_class >= 80)

# Code the if-else construct
if (n_smart > 50) {
    print("smart class")
} else {
    print("rather average")
}
## [1] "smart class"
# Define prop_less
prop_less <- mean(my_class < me)

# Code the control construct
if (prop_less > 0.9) {
    print("you're among the best 10 percent")
} else if (prop_less > 0.8) {
    print("you're among the best 20 percent")
} else {
    print("need more analysis")
}
## [1] "you're among the best 20 percent"
# Embedded control structure: fix the error
if (mean(my_class) < 75) {
  if (mean(my_class) > me) {
    print("average year, but still smarter than me")
  } else {
    print("average year, but I'm not that bad")
  }
} else {
  if (mean(my_class) > me) {
    print("smart year, even smarter than me")
  } else {
    print("smart year, but I am smarter")
  }
}
## [1] "smart year, but I am smarter"
# Create top_grades
top_grades <- my_class[my_class >= 85]

# Create worst_grades
worst_grades <- my_class[my_class < 65]

# Write conditional statement
if (length(top_grades) > length(worst_grades)) { print("top grades prevail") }
## [1] "top grades prevail"

Writing Functions in R

Hadley and Charlotte Wickham led a course on writing functions in R. Broadly, the course includes advice on when/how to use functions, as well as specific advice about commands available through library(purrr).

Key pieces of advice include:

  • Write a function once you have cut and paste some code twice or more
  • Solve a simple problem before writing the function
  • A good function is both correct and understandable
  • Abstract away the for loops when possible (focus on data/actions, solve iteration more easily, have more understandable code), for example using purrr::map() or purr::map_() where type can be dbl, chr, lgl, int, forcing a type-certain output
  • Use purrr::safely() and purrr::possibly() for better error handling
  • Use purr::pmap or purr::walk2 to iterate over 2+ arguments
  • Iterate functions for their side effects (printing, plotting, etc.) using purrr::walk()
  • Use stop() and stopifnot() for error catching of function arguments/output formats
  • Avoid type-inconsistent functions (e.g., sapply)
  • Avoid non-standard functions
  • Never rely on global options (e.g., how the user will have set stringsAsFactors)

John Chambers gave a few useful slogans about functions:

  • Everything that exists is an object
  • Everything that happens is a function call

Each function has three components:

  • formals(x) are in essence the arguments as in args(), but as a list
  • body(x) is the function code
  • environment(x) is where it was defined

Only the LAST evaluated expression is returned. The use of return() is recommended only for early-returns in a special case (for example, when a break() will be called).

Further, functions can be written anonymously on the command line, such as (function (x) {x + 1}) (1:5). A function should only depend on arguments passed to it, not variables from a parent enviornment. Every time the function is called, it receives a clean working environment. Once it finishes, its variables are no longer available unless they were returned (either by default as the last operation, or by way of return()):

# Components of a function
args(rnorm)
## function (n, mean = 0, sd = 1) 
## NULL
formals(rnorm)
## $n
## 
## 
## $mean
## [1] 0
## 
## $sd
## [1] 1
body(rnorm)
## .Call(C_rnorm, n, mean, sd)
environment(rnorm)
## <environment: namespace:stats>
# What is passed back
funDummy <- function(x) {
    if (x <= 2) {
        print("That is too small")
        return(3)  # This ends the function by convention
    }
    ceiling(x)  # This is the defaulted return() value if nothing happened to prevent the code getting here
}

funDummy(1)
## [1] "That is too small"
## [1] 3
funDummy(5)
## [1] 5
# Anonymous functions
(function (x) {x + 1}) (1:5)
## [1] 2 3 4 5 6

The course includes some insightful discussion of vectors. As it happens, lists and data frames are just special collections of vectors in R. Each column of a data frame is a vector, while each element of a list is either 1) an embedded data frame (which is eventually a vector by way of columns), 2) an embedded list (which is eventually a vector by way of recursion), or 3) an actual vector.

The atomic vectors are of types logical, integer, character, and double; complex and raw are rarer types that are also available. Lists are just recursive vectors, which is to say that lists can contain other lists and can be hetergeneous. To explore vectors, you have:

  • typeof() for the type
  • length() for the length

Note that NULL is the absence of a vector and has length 0. NA is the absence of an element in the vector and has length 1. All math operations with NA return NA; for example NA == NA will return NA.

There are some good tips on extracting element from a list:

  • [] is to extract a sub-list
  • [[]] and $ more common and extract elements while removing an element of hierachy
  • seq_along(mtcars) will return 1:11 since there are 11 elements. Helfpully, is applied to a frame with no columns, this returns integer(0) which means the for() loop does not crash
  • mtcars[[11]] will return the 11th element (11th column) of mtcars
  • vector(“type”, “length”) will create a n empty vector of the requested type and length
  • range(x, na.rm=FALSE) gives vector c(xmin, xmax) which can be handy for plotting, scaling, and the like
# Data types
data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
typeof(mtcars)  # n.b. that this is technically a "list"
## [1] "list"
length(mtcars)
## [1] 11
# NULL and NA
length(NULL)
## [1] 0
typeof(NULL)
## [1] "NULL"
length(NA)
## [1] 1
typeof(NA)
## [1] "logical"
NULL == NULL
## logical(0)
NULL == NA
## logical(0)
NA == NA
## [1] NA
is.null(NULL)
## [1] TRUE
is.null(NA)
## [1] FALSE
is.na(NULL)
## Warning in is.na(NULL): is.na() applied to non-(list or vector) of type
## 'NULL'
## logical(0)
is.na(NA)
## [1] TRUE
# Extraction
mtcars[["mpg"]][1:5]
## [1] 21.0 21.0 22.8 21.4 18.7
mtcars[[2]][1:5]
## [1] 6 6 4 6 8
mtcars$hp[1:5]
## [1] 110 110  93 110 175
# Relevant lengths
seq_along(mtcars)
##  [1]  1  2  3  4  5  6  7  8  9 10 11
x <- data.frame()
seq_along(x)
## integer(0)
length(seq_along(x))
## [1] 0
foo <- function(x) { for (eachCol in seq_along(x)) { print(typeof(x[[eachCol]])) }}
foo(mtcars)
## [1] "double"
## [1] "double"
## [1] "double"
## [1] "double"
## [1] "double"
## [1] "double"
## [1] "double"
## [1] "double"
## [1] "double"
## [1] "double"
## [1] "double"
foo(x)  # Note that this does nothing!
data(airquality)
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
foo(airquality)
## [1] "integer"
## [1] "integer"
## [1] "double"
## [1] "integer"
## [1] "integer"
## [1] "integer"
# Range command
mpgRange <- range(mtcars$mpg)
mpgRange
## [1] 10.4 33.9
mpgScale <- (mtcars$mpg - mpgRange[1]) / (mpgRange[2] - mpgRange[1])
summary(mpgScale)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2138  0.3745  0.4124  0.5277  1.0000

The typical arguments in a function use a consistent, simple naming function:

  • x, y, z: vectors
  • df: data frame
  • i, j: numeric indices (generally rows and columns)
  • n: length of number of rows
  • p: number of columns

Data arguments should come before detail arguments, and detail arguments should be given reasonable default values. See for example rnorm(n, mean=0, sd=1). The number requested (n) must be specified, but defaults are available for the details (mean and standard deviation).

Functional Programming and library(purrr)

Functions can be passed as arguments to other functions, which is at the core of functional programming. For example:

do_math <- function(x, fun) { fun(x) }
do_math(1:10, fun=mean)
## [1] 5.5
do_math(1:10, fun=sd)
## [1] 3.02765

The library(purrr) takes advantage of this, and in a type-consistent manner. There are functions for:

  • map() will create a list as the output
  • map_chr() will create a character vector as the output
  • map_dbl() will create a double vector as the output
  • map_int() will create an integer vector as the output
  • map_lgl() will create a logical (boolean) vector as the output

The general arguments are .x (a list or an atomic vector) and .f which can be either a function, an anonymous function (formula with ~), or an extractor .x[[.f]]. For example:

library(purrr)
## Warning: package 'purrr' was built under R version 3.2.5
library(RColorBrewer)  # Need to have in non-cached chunk for later

data(mtcars)

# Create output as a list
map(.x=mtcars, .f=sum)
## $mpg
## [1] 642.9
## 
## $cyl
## [1] 198
## 
## $disp
## [1] 7383.1
## 
## $hp
## [1] 4694
## 
## $drat
## [1] 115.09
## 
## $wt
## [1] 102.952
## 
## $qsec
## [1] 571.16
## 
## $vs
## [1] 14
## 
## $am
## [1] 13
## 
## $gear
## [1] 118
## 
## $carb
## [1] 90
# Create same output as a double
map_dbl(.x=mtcars, .f=sum)
##      mpg      cyl     disp       hp     drat       wt     qsec       vs 
##  642.900  198.000 7383.100 4694.000  115.090  102.952  571.160   14.000 
##       am     gear     carb 
##   13.000  118.000   90.000
# Create same output as integer
# map_int(.x=mtcars, .f=sum) . . . this would bomb since it is not actually an integere
map_int(.x=mtcars, .f=function(x) { as.integer(round(sum(x), 0)) } )
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##  643  198 7383 4694  115  103  571   14   13  118   90
# Same thing but using an anonymous function with ~ and .
map_int(.x=mtcars, .f = ~ as.integer(round(sum(.), 0)) )
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##  643  198 7383 4694  115  103  571   14   13  118   90
# Create a boolean vector
map_lgl(.x=mtcars, .f = ~ ifelse(sum(.) > 200, TRUE, FALSE) )
##   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
##  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
# Create a character vector
map_chr(.x=mtcars, .f = ~ ifelse(sum(.) > 200, "Large", "Not So Large") )
##            mpg            cyl           disp             hp           drat 
##        "Large" "Not So Large"        "Large"        "Large" "Not So Large" 
##             wt           qsec             vs             am           gear 
## "Not So Large"        "Large" "Not So Large" "Not So Large" "Not So Large" 
##           carb 
## "Not So Large"
# Use the extractor [pulls the first row]
map_dbl(.x=mtcars, .f=1)
##    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear 
##  21.00   6.00 160.00 110.00   3.90   2.62  16.46   0.00   1.00   4.00 
##   carb 
##   4.00
# Example from help file using chaining
mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .x)) %>%
  map(summary) %>%
  map_dbl("r.squared")
##         4         6         8 
## 0.5086326 0.4645102 0.4229655
# Using sapply
sapply(split(mtcars, mtcars$cyl), FUN=function(.x) { summary(lm(mpg ~ wt, data=.x))$r.squared } )
##         4         6         8 
## 0.5086326 0.4645102 0.4229655
# Use the extractor from a list
cylSplit <- split(mtcars, mtcars$cyl)
map(cylSplit, "mpg")
## $`4`
##  [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4
## 
## $`6`
## [1] 21.0 21.0 21.4 18.1 19.2 17.8 19.7
## 
## $`8`
##  [1] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 13.3 19.2 15.8 15.0
map(cylSplit, "cyl")
## $`4`
##  [1] 4 4 4 4 4 4 4 4 4 4 4
## 
## $`6`
## [1] 6 6 6 6 6 6 6
## 
## $`8`
##  [1] 8 8 8 8 8 8 8 8 8 8 8 8 8 8

The purrr library has several additional interesting functions:

  • safely() is a wrapper for any functions that traps the errors and returns a relevant list
  • possibly() is similar to safely() with the exception that a default value for error cases is supplied
  • quietly() is a wrapper to suppress verbosity
  • transpose() reverses the order of lists (making the inner-most lists the outer-most lists), which is an easy way to extract either all the answers or all the error cases
  • map2(.x, .y, .f) allows two inputs to be passed to map()
  • pmap(.l, .f) allows passing a named list with as many inputs as needed to function .f
  • invoke_map(.f, .x, …) lets you iterate over a list of functions .f
  • walk() is like map() but called solely to get function side effects (plot, save, etc.); it also returns the object that is passed to it, which can be convenient for chaining (piping)

Some example code includes:

library(purrr)  # Called again for clarity; all these key functions belong to purrr

# safely(.f, otherwise = NULL, quiet = TRUE)
safe_log10 <- safely(log10)
map(list(0, 1, 10, "a"), .f=safe_log10)
## [[1]]
## [[1]]$result
## [1] -Inf
## 
## [[1]]$error
## NULL
## 
## 
## [[2]]
## [[2]]$result
## [1] 0
## 
## [[2]]$error
## NULL
## 
## 
## [[3]]
## [[3]]$result
## [1] 1
## 
## [[3]]$error
## NULL
## 
## 
## [[4]]
## [[4]]$result
## NULL
## 
## [[4]]$error
## <simpleError in .f(...): non-numeric argument to mathematical function>
# possibly(.f, otherwise, quiet = TRUE)
poss_log10 <- possibly(log10, otherwise=NaN)
map_dbl(list(0, 1, 10, "a"), .f=poss_log10)
## [1] -Inf    0    1  NaN
# transpose() - note that this can become masked by data.table::transpose() so be careful
purrr::transpose(map(list(0, 1, 10, "a"), .f=safe_log10))
## $result
## $result[[1]]
## [1] -Inf
## 
## $result[[2]]
## [1] 0
## 
## $result[[3]]
## [1] 1
## 
## $result[[4]]
## NULL
## 
## 
## $error
## $error[[1]]
## NULL
## 
## $error[[2]]
## NULL
## 
## $error[[3]]
## NULL
## 
## $error[[4]]
## <simpleError in .f(...): non-numeric argument to mathematical function>
purrr::transpose(map(list(0, 1, 10, "a"), .f=safe_log10))$result
## [[1]]
## [1] -Inf
## 
## [[2]]
## [1] 0
## 
## [[3]]
## [1] 1
## 
## [[4]]
## NULL
unlist(purrr::transpose(map(list(0, 1, 10, "a"), .f=safe_log10))$result)
## [1] -Inf    0    1
purrr::transpose(map(list(0, 1, 10, "a"), .f=safe_log10))$error
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## <simpleError in .f(...): non-numeric argument to mathematical function>
map_lgl(purrr::transpose(map(list(0, 1, 10, "a"), .f=safe_log10))$error, is.null)
## [1]  TRUE  TRUE  TRUE FALSE
# map2(.x, .y, .f)
map2(list(5, 10, 20), list(1, 2, 3), .f=rnorm) # rnorm(5, 1), rnorm(10, 2), and rnorm(20, 3)
## [[1]]
## [1] 0.41176421 2.00652288 0.06152025 0.46963873 1.15436157
## 
## [[2]]
##  [1] 0.006821057 2.902712636 1.436150816 1.377836302 2.625075832
##  [6] 0.680797806 0.313499192 0.718062969 2.820989906 3.134207742
## 
## [[3]]
##  [1] 3.3716474 2.9393673 1.8648940 3.2343343 2.1849894 2.0697179 1.0872014
##  [8] 3.4970403 3.5769694 3.0999340 1.2033341 0.9839011 2.9820314 1.7116383
## [15] 0.8779558 1.6990118 2.5914013 2.3587803 3.7460957 1.2980312
# pmap(.l, .f)
pmap(list(n=list(5, 10, 20), mean=list(1, 5, 10), sd=list(0.1, 0.5, 0.1)), rnorm)
## [[1]]
## [1] 1.0151570 1.1573287 1.0628581 0.8805484 0.9418430
## 
## [[2]]
##  [1] 5.032920 4.689799 5.423525 5.265610 4.727383 5.252325 5.166292
##  [8] 4.861745 5.135408 4.106679
## 
## [[3]]
##  [1]  9.854138 10.090939 10.045554  9.970755 10.092487  9.769531 10.140064
##  [8]  9.834716 10.196817 10.047367 10.054093 10.006439 10.142002 10.092259
## [15] 10.222459 10.082440 10.067818  9.993884 10.078380  9.936942
# invoke_map(.f, .x, ...)
invoke_map(list(rnorm, runif, rexp), n=5)
## [[1]]
## [1] -0.96707137  0.08207476  1.39498168  0.60287972 -0.15130461
## 
## [[2]]
## [1] 0.01087442 0.02980483 0.81443586 0.88438198 0.67976034
## 
## [[3]]
## [1] 0.2646751 1.3233260 1.1079261 1.3504952 0.6795524
# walk() is for the side effects of a function
x <- list(1, "\n\ta\n", 3)
x %>% walk(cat)
## 1
##  a
## 3
# Chaining is available by way of the %>% operator
pretty_titles <- c("N(0, 1)", "Uniform(0, 1)", "Exponential (rate=1)")
set.seed(1607120947)
x <- invoke_map(list(rnorm, runif, rexp), n=5000)
foo <- function(x) { map(x, .f=summary) }
par(mfrow=c(1, 3))
pwalk(list(x=x, main=pretty_titles), .f=hist, xlab="", col="light blue") %>% map(.f=foo)

## $x
## $x[[1]]
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.711000 -0.637800 -0.000217  0.006543  0.671800  3.633000 
## 
## $x[[2]]
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0001241 0.2518000 0.5012000 0.5028000 0.7566000 0.9999000 
## 
## $x[[3]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00001 0.29140 0.68340 0.98260 1.37900 8.46300 
## 
## 
## $main
## $main[[1]]
##    Length     Class      Mode 
##         1 character character 
## 
## $main[[2]]
##    Length     Class      Mode 
##         1 character character 
## 
## $main[[3]]
##    Length     Class      Mode 
##         1 character character
par(mfrow=c(1, 1))

Writing Robust Functions

There are two potentially desirable behaviors with functions:

  • Relaxed (default R approach) - make reasonable guesses about what you mean, which is particularly useful for interactive analyses
  • Robust (programming) - strict functions that throw errors rather than guessing in light of uncertainty

As a best practice, R functions that will be used for programming (as opposed to interactive command line work) should be written in a robust manner. Three standard problems should be avoided/mitigated:

  • Type-unstable - may return a vector one time, and a list the next
  • Non-standard evaluation - can use succinct API, but can introduce ambiguity
  • Hidden arguments - dependence on global functions/environments

There are several methods available for throwing errors within an R function:

  • stopifnot(expression) will stop and throw an error unless expression is TRUE
  • if (expression) { stop(“Error”, call.=FALSE) }
  • if (expression) { stop(" ‘x’ should be a character vector“, call.=FALSE) }
    • call.=FALSE means that the call to the function should not be shown (???) - Hadley recommends this

One example that commonly creates surprises is the [,] operator for extraction. Adding [ , , drop=FALSE] ensures that you will still have what you passed (e.g., a matrix or data frame) rather than conversion of a chunk of data to a vector.

Another common source of error is sapply() which will return a vector when it can and a list otherwise. The map() and map_typ() functions in purrr are designed to be type-stable; if the output is not as expected, they will error out.

Non-standard evaluations take advantage of the existence of something else (e.g., a variable in the parent environment that has not been passed). This can cause confusion and improper results.

  • subset(mtcars, disp > 400) takes advantage of disp being an element of mtcars; disp would crash if called outside subset
  • This can cause problems when it is embedded inside a function
  • ggplot and dplyr frequently have these behaviors also
    • The risk is that you can also put variables from the global environment in to the same call

Pure functions have the key properties that 1) their output depends only on their inputs, and 2) they do not impact the outside world other than by way of their return value. Specifically, the function should not depend on how the user has configured their global options as shown in options(), nor should it modify those options() settings upon return of control to the parent environment.

A few examples are shown below:

# Throwing errors to stop a function (cannot actually run these!)
# stopifnot(FALSE)
# if (FALSE) { stop("Error: ", call.=FALSE) }
# if (FALSE) { stop("Error: This condition needed to be set as TRUE", call.=FALSE) }

# Behavior of [,] and [,,drop=FALSE]
mtxTest <- matrix(data=1:9, nrow=3, byrow=TRUE)
class(mtxTest)
## [1] "matrix"
mtxTest[1, ]
## [1] 1 2 3
class(mtxTest[1, ])
## [1] "integer"
mtxTest[1, , drop=FALSE]
##      [,1] [,2] [,3]
## [1,]    1    2    3
class(mtxTest[1, , drop=FALSE])
## [1] "matrix"
# Behavior of sapply() - may not get what you are expecting
foo <- function(x) { x^2 }
sapply(1:5, FUN=foo)
## [1]  1  4  9 16 25
class(sapply(1:5, FUN=foo))
## [1] "numeric"
sapply(c(1, list(1.5, 2, 2.5), 3, 4, 5), FUN=foo)
## [1]  1.00  2.25  4.00  6.25  9.00 16.00 25.00
class(sapply(c(1, list(1.5, 2, 2.5), 3, 4, 5), FUN=foo))
## [1] "numeric"
sapply(list(1, c(1.5, 2, 2.5), 3, 4, 5), FUN=foo)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2.25 4.00 6.25
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25
class(sapply(list(1, c(1.5, 2, 2.5), 3, 4, 5), FUN=foo))
## [1] "list"

This was a very enjoyable and instructive course.

Importing Data in to R

This course provides an overview of loading data in to R from five main sources:

  • Flat files
  • Excel files
  • Statistical software
  • Databases
  • Web data

Reading Flat Files

At the most basic level, the utlis library easily handles reading most types of flat files:

  • read.table(file, header=FALSE, sep=“”, stringsAsFactors=default.stringsAsFactors(), )
  • read.csv(file, header = TRUE, sep = “,”, quote = “"”, dec = “.”, fill = TRUE, comment.char = “”, …)
  • read.delim(file, header = TRUE, sep = “”, quote = “"”, dec = “.”, fill = TRUE, comment.char = “”, …)

There are also European equivalents in case the decimal needs to be set as “,” to read in the file:

  • read.csv2(file, header = TRUE, sep = “;”, quote = “"”, dec = “,”, fill = TRUE, comment.char = “”, …)
  • read.delim2(file, header = TRUE, sep = “”, quote = “"”, dec = “,”, fill = TRUE, comment.char = “”, …)

The file.path() command is a nice way to put together file paths. It is more or less equivalent to paste(, sep=“/”), but with the benefit that sep is machine/operating-system dependent, so it may be easier to use across platforms.

Further, there is the option to use colClasses() to specify the type in each column, with NULL meaning do not import. Abbreviations can be used for these as well:

# colClasses (relevant abbreviations)
R.utils::colClasses("-?cdfilnrzDP")
##  [1] "NULL"      "NA"        "character" "double"    "factor"   
##  [6] "integer"   "logical"   "numeric"   "raw"       "complex"  
## [11] "Date"      "POSIXct"
# file.path example
file.path("..", "myplot.pdf")
## [1] "../myplot.pdf"
# Key documentation for reading flat files
# 
# read.table(file, header = FALSE, sep = "", quote = "\"'",
#            dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
#            row.names, col.names, as.is = !stringsAsFactors,
#            na.strings = "NA", colClasses = NA, nrows = -1,
#            skip = 0, check.names = TRUE, fill = !blank.lines.skip,
#            strip.white = FALSE, blank.lines.skip = TRUE,
#            comment.char = "#",
#            allowEscapes = FALSE, flush = FALSE,
#            stringsAsFactors = default.stringsAsFactors(),
#            fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
# 
# read.csv(file, header = TRUE, sep = ",", quote = "\"",
#          dec = ".", fill = TRUE, comment.char = "", ...)
# 
# read.csv2(file, header = TRUE, sep = ";", quote = "\"",
#           dec = ",", fill = TRUE, comment.char = "", ...)
# 
# read.delim(file, header = TRUE, sep = "\t", quote = "\"",
#            dec = ".", fill = TRUE, comment.char = "", ...)
# 
# read.delim2(file, header = TRUE, sep = "\t", quote = "\"",
#             dec = ",", fill = TRUE, comment.char = "", ...)

There are also two libraries that can be especially helpful for reading in flat files - readr and data.table.

  • readr::read_delim() handles many data types
  • readr::read_delim(file, delim=“,”) will read a CSV
    • assumes col_names=TRUE (eq to header=TRUE)
    • assumes col_types=NULL (imputed from first 100 rows, side effect - no need for stringAsFactors = FALSE)
  • col_types can use short type, where c=character, d=double (numeric), i=integer, l=logical (boolean), _=skip
    • col_names = FALSE means make your own
    • col_names = c() means here are the column names you should use
  • skip=
  • n_max=
  • read_csv() is for CSV
  • read_tsv is for tab-separated values

  • data.table() is designed for speed
    • data.table::fread() is for fast reading
    • The fread() automatically handles the column names and also infers the column separators
    • This is a faster, more convenients, and easier to customize version of read.table()
  • Wrappers for the readr() function
    • fac <- col_factor(levels = c(“Beef”, “Meat”, “Poultry”))
    • int <- col_integer()
    • hotdogsFactor <- read_tsv(“hotdogs.txt”,
      col_names = c(“type”, “calories”, “sodium”), col_types = list(fac, int, int) )

Reading Excel Files

Further, the library(readxl) is handy for loading Excel sheets:

  • readxl::excel_sheets() will list the sheets
    • excel_sheets(path)
  • readxl::read_excel() will read in a specific sheet
    • read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, na = “”, skip=0)
      • col_names: Either TRUE to use the first row as column names, FALSE to number columns sequentially from X1 to Xn, or a character vector giving a name for each column
      • col_types: Either NULL to guess from the spreadsheet or a character vector containing “blank”, “numeric”, “date” or “text”
  • lapply(excel_sheets(myXLS), FUN=read_excel, path=myXLS) provide all data from all sheets in a list

Reading Statistical Software Files

R can also load files from common statistical software such as SAS, STATA, SPSS, and MATLAB/Octave. The packages haven() by Wickham and foreign() by the R core team are two common examples. The R.matlab() allows for reading to/from MATLAB/Octave:

The library(haven) contains wrappers to the ReadStat package, a C library by Evan Miller, for reading files from SAS, STATA, and SPSS:

  • read_sas(filename)
  • read_stata(filename)
  • read_dta(filename)
    • as_factor(R_column) will help if the type is “labelled”
    • as.character(as_factor(R_column)) will turn it back to a character vector
  • read_spss(filename) which is a wrapper to read_por() and read_sav()

The library(foreign) can read/write all types of foreign formats, with some caveats:

  • Only SAS libraries (.xport) can be read in; seems quite a drawback to not be able to read SAS files!
  • read.dta(file, convert.factors = TRUE, convert.dates=TRUE, missing.type=FALSE)
  • read.spss(file, use.value.labels = TRUE, to.data.frame = FALSE)

Finally, the R.matlab() library is available for reading/writing MATLAB/Octave files. Per the help file:

  • Methods readMat() and writeMat() for reading and writing MAT files. For user with MATLAB v6 or newer installed (either locally or on a remote host), the package also provides methods for controlling MATLAB (trademark) via R and sending and retrieving data between R and MATLAB.

  • In brief, this package provides a one-directional interface from R to MATLAB, with communication taking place via a TCP/IP connection and with data transferred either through another connection or via the file system. On the MATLAB side, the TCP/IP connection is handled by a small Java add-on.

  • The methods for reading and writing MAT files are stable. The R to MATLAB interface, that is the Matlab class, is less prioritized and should be considered a beta version.

  • readMat(con, maxLength=NULL, fixNames=TRUE, drop=c(“singletonLists”), sparseMatrixClass=c(“Matrix”, “SparseM”, “matrix”), verbose=FALSE, …)
    • Returns a named list structure containing all variables in the MAT file structure.
  • writeMat(con, …, matVersion=“5”, onWrite=NULL, verbose=FALSE)
    • con: Binary connection to which the MAT file structure should be written to. A string is interpreted as filename, which then will be opened (and closed afterwards).
    • …: Named variables to be written where the names must be unique.

Reading Relational DB Files

Relational databases in R (DBMS tend to use SQL for queries), including libraries:

  • RMySQL
  • RPostgresSQL
  • ROracle
  • RSQLite

Conventions are specified in DBI; see library(DBI):

  • dbConnect(drv, …)
  • drv

Create the connection as “con” (or whatever) and then use that elsewhere:

  • dbListTables(con) # What tables are in this
  • dbReadTable(con, tablename)

When finished, dbDisconnect(con) as a courtesy so as to not tie up resources.

SQL queries from inside R - per previous, library(DBI) and then create the connection “con”:

  • dbGetQuery(con, valid_SQL_code)
    • Appears that " is an escape character to use quotes inside quotes
    • dbGetQuery(con, “SELECT name FROM employees WHERE started_at > "2012-09-01"”)
    • dbGetQuery(con, “SELECT * FROM products WHERE contract=1”)
  • res <- dbSendQuery(con, validSQLcode) # If instead you want to grab only chunks of records
    • dbFetch(res) # Can specify chunking, such as dbFetch(res, n=1) for one line at a time
    • dbHasCompleted(res) # see whether it is done
      • while(!dbHasCompleted(res)) { chunk <- dbFetch(res, n=1); print(chunk) }
    • dbClearResult(res)

For example, using “./SQLforDataCampRMD_v01.db”, run a few SQL commands:

# uses libraries DBI for the connection and RSQLite to interface with SQLite Browser on my machine
con <- DBI::dbConnect(RSQLite::SQLite(), "SQLforDataCampRMD_v01.db")

# List the tables, and drop dummy if it already exists
DBI::dbListTables(con)
## [1] "dummy"
DBI::dbGetQuery(con, "DROP TABLE IF EXISTS dummy")

# Create blank table
DBI::dbListTables(con)
## character(0)
DBI::dbGetQuery(con, "CREATE TABLE IF NOT EXISTS dummy (id PRIMARY KEY, name CHAR)")
DBI::dbGetQuery(con, "INSERT OR IGNORE INTO dummy (id, name) VALUES (1, 'Amy')")
DBI::dbGetQuery(con, "INSERT OR IGNORE INTO dummy (id, name) VALUES (2, 'Bill')")
DBI::dbGetQuery(con, "INSERT OR IGNORE INTO dummy (id, name) VALUES (2, 'Jen')") # Should do nothing
DBI::dbGetQuery(con, "SELECT * FROM dummy")
##   id name
## 1  1  Amy
## 2  2 Bill
DBI::dbListTables(con)
## [1] "dummy"
# Can continue passing SQL commands back and forth as needed

# Close the connection
DBI::dbDisconnect(con)
## [1] TRUE

Reading Web Data

Many of the R read-in libraries already work well with web data. For example, read.csv(“mywebsite.com”, stringAsFactors=FALSE) will read a CSV right off the internet. Further, there are options for:

  • download.file(url, dest_path) # Reproducibility advantages over right-click and save
  • library(httr) by Hadley Wickham, including GET() and content()

The jsonlite library is good for working with JSON:

  • fromJSON(url) will create a named R list (often creates a data frame also)
    • JSON objects are name:value pairs
    • JSON arrays convert to vectors
    • JSON can also created nested arrays
  • toJSON()

Prettify adds indentation to a JSON string; minify removes all indentation/whitespace:

  • prettify() prettify(txt, indent = 4)
  • minify() minify(txt)
jsonLoc <- file.path("../../..", "PythonDirectory", "UMModule04", "roster_data.json")
jsonData <- jsonlite::fromJSON(jsonLoc)
str(jsonData)
##  chr [1:379, 1:3] "Calvin" "Wilson" "Emi" "Rosina" "Sylvie" ...
head(jsonData)
##      [,1]        [,2]    [,3]
## [1,] "Calvin"    "si110" "1" 
## [2,] "Wilson"    "si110" "0" 
## [3,] "Emi"       "si110" "0" 
## [4,] "Rosina"    "si110" "0" 
## [5,] "Sylvie"    "si110" "0" 
## [6,] "Katarzyna" "si110" "0"

Cleaning Data in R

The general analysis pipeline is Collect -> Clean -> Analyze -> Report. Cleaning is needed so the raw data can work with more traditional tools (e.g., packages in Python or R). 50% - 80% of time is spent in the Collect/Clean realm, even though this is not the most exciting (and thus taught) part of data analysis. There are generally three stages of data cleaning: Explore -> Tidy -> Prepare

Exploring the Data:

  • class()
  • dim()
  • names() # column names
  • str()
  • dplyr::glimpse() # a nicer version of str()
  • summary()

Viewing the Data:

  • head(file, n=6)
  • tail(file, n=6)
  • hist(variable)
  • plot(x, y)
  • print() # not recommended for larger datasets

Tidy data - Wickham 2014, Principles of Tidy Data:

  • Each row should be an observation
  • Each column should be a variable (attribute)
  • Column headers should not be variables; e.g., eye color should be a single column, not many columns of 0/1
  • Each intersection should be a value (intersection of the observation and attribute)
  • Only one type of observational unit (e.g., each row is a person) per table

The principles of tidy data can be implemented using library(tidyr):

  • tidyr::gather(data, key, value, …, na.rm = FALSE, convert = FALSE, factor_key = FALSE) # gather the data in to key-value pairs
    • key, value Names of key and value columns to create in output.
    • … Specification of columns to gather. Use bare variable names. Select all variables between x and z with x:z, exclude y with -y. For more options, see the select documentation.
  • tidyr::spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE) # spread the key-value pairs in to columns
    • key The bare (unquoted) name of the column whose values will be used as column headings.
    • value The bare (unquoted) name of the column whose values will populate the cells.
    • fill If set, missing values will be replaced with this value. Note that there are two types of missing in the input: explicit missing values (i.e. NA), and implicit missings, rows that simply aren’t present. Both types of missing value will be replaced by fill.
  • tidyr::separate(data, col, into, sep = “[^[:alnum:]]+”, remove = TRUE, convert = FALSE, extra = “warn”, fill = “warn”, …)
    • col Bare column name.
    • into Names of new variables to create as character vector.
    • sep Separator between columns. If character, is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values. If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep should be one less than into.
  • tidyr::unite(data, col, …, sep = “_“, remove = TRUE)
    • col (Bare) name of column to add
    • … Specification of columns to unite. Use bare variable names. Select all variables between x and z with x:z, exclude y with -y. For more options, see the select documentation.
    • sep Separator to use between values.

Common symptoms of messy data include:

  • Column headers are values rather than variable names – use tidyr::gather()
  • Variables stored in both rows and columns – use tidyr::spread()
  • Multiple variables are stored in a single column – use tidyr::separate()
  • Singe type of observational unit (e.g., people) stored in 2+ tables
  • Multiple types of observational units (e.g., people and pets) stored in a single table

Example code includes:

# tidyr::gather()
stocks <- data.frame(time = as.Date('2009-01-01') + 0:4, 
                     X = rnorm(5, 0, 1), Y = rnorm(5, 0, 2), Z = rnorm(5, 0, 4)
                     )
stocks
##         time           X          Y         Z
## 1 2009-01-01  1.64736472 -0.1020457 -8.074672
## 2 2009-01-02  0.32981671 -0.2377234  7.617473
## 3 2009-01-03  0.05010405 -0.7091054 -9.770047
## 4 2009-01-04  0.41187479  1.1899260 -1.655071
## 5 2009-01-05 -2.20625659 -1.1299452  1.615068
# will create new columns stock (each of X, Y, Z) and price (the values that had been in X, Y, and Z), 
# while not gathering the time variable; final table will be time-stock-price
stockGather <- tidyr::gather(stocks, stock, price, -time)  
stockGather
##          time stock       price
## 1  2009-01-01     X  1.64736472
## 2  2009-01-02     X  0.32981671
## 3  2009-01-03     X  0.05010405
## 4  2009-01-04     X  0.41187479
## 5  2009-01-05     X -2.20625659
## 6  2009-01-01     Y -0.10204566
## 7  2009-01-02     Y -0.23772338
## 8  2009-01-03     Y -0.70910541
## 9  2009-01-04     Y  1.18992602
## 10 2009-01-05     Y -1.12994523
## 11 2009-01-01     Z -8.07467238
## 12 2009-01-02     Z  7.61747283
## 13 2009-01-03     Z -9.77004663
## 14 2009-01-04     Z -1.65507149
## 15 2009-01-05     Z  1.61506772
# tidyr::spread()
tidyr::spread(stockGather, stock, price)
##         time           X          Y         Z
## 1 2009-01-01  1.64736472 -0.1020457 -8.074672
## 2 2009-01-02  0.32981671 -0.2377234  7.617473
## 3 2009-01-03  0.05010405 -0.7091054 -9.770047
## 4 2009-01-04  0.41187479  1.1899260 -1.655071
## 5 2009-01-05 -2.20625659 -1.1299452  1.615068
# TRUE (this fully reverses what the gather function has done)
identical(tidyr::spread(stockGather, stock, price), stocks)  
## [1] TRUE
# tidyr::separate()
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df
##      x
## 1 <NA>
## 2  a.b
## 3  a.d
## 4  b.c
# by default, the splits occur on anything that is not alphanumeric, 
# so you get column A as whatever is before the dot and column B as whatever is after the dot
dfSep <- tidyr::separate(df, x, c("A", "B"))
dfSep
##      A    B
## 1 <NA> <NA>
## 2    a    b
## 3    a    d
## 4    b    c
# tidyr::unite()
tidyr::unite(dfSep, united, c(A, B), sep="")
##   united
## 1   NANA
## 2     ab
## 3     ad
## 4     bc
is.na(dfSep) # caution . . . 
##       A     B
## 1  TRUE  TRUE
## 2 FALSE FALSE
## 3 FALSE FALSE
## 4 FALSE FALSE
is.na(tidyr::unite(dfSep, united, c(A, B), sep="")) # caution . . . 
##   united
## 1  FALSE
## 2  FALSE
## 3  FALSE
## 4  FALSE
  • The library(lubridate) can be helpful for coercing strings to dates
    • ymd() # coerces a character string that is in year-month-day originally
    • mdy() # coerces a character string that is in month-day-year originally
    • hms() # coerces a character string that is in hours-minutes-seconds originally
    • ymd_hms() # coerces a character string that is in year-month-day-hours-minutes-seconds
  • The library(stringr) can be helpful for working with strings
    • stringr::str_trim() # trails the leading and trailing white space
    • stringr::str_pad(char, width, side, pad) # adds characters in place of white space at the start/end
    • stringr::str_detect(inVector, searchTerm) # is searchTerm in each iterm of vector (boolean, same length as inVector)
    • stringr::str_replace(inVector, searchTerm, replaceTerm) # searchTerm will be replaced by replaceTerm in each iterm of vector (output same length/type as inVector)
      • social_df\(status <- str_replace(social_df\)status, “^$”, NA) # Nice way to make the blanks in to NA
  • The tolower() and toupper() commands can be very useful also

  • Missing and special values
    • May be randomly missing, but very dangerous to assume!
    • In R, these are NA
      • Excel may have #N/A
      • SPSS and SAS may have .
      • Sometimes just shows up as a missing string
    • Inf is for infinite
    • NaN is for not a number
  • Finding these special values
    • is.na()
    • any(is.na())
    • sum(is.na())
    • summary() # will tell the number of NA
    • complete.cases() # TRUE is the entire row is non-NA; FALSE otherwise
    • na.omit() # equivalent to x[complete.cases(x), ]

Example code includes:

# lubridate::ymd()
lubridate::ymd("160720")
## [1] "2016-07-20 UTC"
lubridate::ymd("2016-7-20")
## [1] "2016-07-20 UTC"
lubridate::ymd("16jul20")
## [1] "2016-07-20 UTC"
lubridate::ymd("16/07/20")
## [1] "2016-07-20 UTC"
# lubridate::hms()
lubridate::hms("07h15:00")
## [1] "7H 15M 0S"
lubridate::hms("17 hours, 15 minutes 00 seconds")
## [1] "17H 15M 0S"
lubridate::hms("07-15-00")
## [1] "7H 15M 0S"
# From ?stringr::str_detect
# 
# str_detect(string, pattern)
#   string  Input vector. Either a character vector, or something coercible to one.
#   pattern Pattern to look for.  The default interpretation is a regular expression, as described in stringi-search-regex. Control options with regex().  Match a fixed string (i.e. by comparing only bytes), using fixed(x). This is fast, but approximate. Generally, for matching human text, you'll want coll(x) which respects character matching rules for the specified locale.  Match character, word, line and sentence boundaries with boundary(). An empty pattern, "", is equivalent to boundary("character").
# 

fruit <- c("apple", "banana", "pear", "pinapple")

stringr::str_detect(fruit, "a")
## [1] TRUE TRUE TRUE TRUE
stringr::str_detect(fruit, "^a")
## [1]  TRUE FALSE FALSE FALSE
stringr::str_detect(fruit, "a$")
## [1] FALSE  TRUE FALSE FALSE
stringr::str_detect(fruit, "b")
## [1] FALSE  TRUE FALSE FALSE
stringr::str_detect(fruit, "[aeiou]")
## [1] TRUE TRUE TRUE TRUE
# Also vectorised over pattern
stringr::str_detect("aecfg", letters)
##  [1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE
# From ?stringr::str_replace
#
# str_replace(string, pattern, replacement)
# str_replace_all(string, pattern, replacement)
#   string  Input vector. Either a character vector, or something coercible to one.
#   pattern, replacement    Supply separate pattern and replacement strings to vectorise over the patterns. References of the form \1, \2 will be replaced with the contents of the respective matched group (created by ()) within the pattern.  For str_replace_all only, you can perform multiple patterns and replacements to each string, by passing a named character to pattern.
#

someNA <- c(letters, "", LETTERS, "")
someNA[someNA==""] <- NA
someNA
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z" NA  "A" "B" "C" "D" "E" "F" "G"
## [35] "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X"
## [52] "Y" "Z" NA
fruits <- c("one apple", "two pears", "three bananas")
stringr::str_replace(fruits, "[aeiou]", "-")  # Replace FIRST instance
## [1] "-ne apple"     "tw- pears"     "thr-e bananas"
stringr::str_replace_all(fruits, "[aeiou]", "-")  # Replace ALL instances
## [1] "-n- -ppl-"     "tw- p--rs"     "thr-- b-n-n-s"
stringr::str_replace(fruits, "([aeiou])", "\\1\\1\\1")  # Triple up on the first vowel
## [1] "ooone apple"     "twooo pears"     "threeee bananas"
stringr::str_replace(fruits, "[aeiou]", c("1", "2", "3"))  # First vowel to 1, 2, 3 in word 1, 2, 3
## [1] "1ne apple"     "tw2 pears"     "thr3e bananas"
stringr::str_replace(fruits, c("a", "e", "i"), "-")  # First a -> - in word 1, first e -> - in word 2 . . . 
## [1] "one -pple"     "two p-ars"     "three bananas"
stringr::str_replace_all(fruits, "([aeiou])", "\\1\\1")  # Double up on all vowels
## [1] "oonee aapplee"      "twoo peeaars"       "threeee baanaanaas"
stringr::str_replace_all(fruits, "[aeiou]", c("1", "2", "3"))  # All vowels to 1, 2, 3, in word 1, 2, 3
## [1] "1n1 1ppl1"     "tw2 p22rs"     "thr33 b3n3n3s"
stringr::str_replace_all(fruits, c("a", "e", "i"), "-")  # All a -> - in word 1, . . . 
## [1] "one -pple"     "two p-ars"     "three bananas"

Further, the outline from the weather gathering data cleaning challenge is noted:

  • The weather data tidying challenge
    • Historical weather data from Boston
      • 12 months beginning December 2014
      • Columns are values (X1 means day 1, X2 means day 2, etc.), while measure (Max, Min, etc.) should be variables
      • Variables coded incorrectly
      • Missing and extreme values
      • Etc.
  • STEP 1: UNDERSTAND STRUCTURE
    • class(), dim(), names(), str(), dplyr::glimpse(), summary()
  • STEP 2: LOOK AT DATA
    • head(), tail(), print(), hist(), plot()
  • STEP 3: TIDY DATA
    • gather(), spread(), separate()
  • STEP 4: CONVERT TYPES
    • lubridate() and variants, as.character() and variants, stringr() and variants, tidyr::unite()
  • STEP 5: MANAGE MISSING and EXTREME (OUTLIER) VALUES
    • is.na(), sum(is.na()), any(is.na()), which(is.na())
    • which(a, arr.ind=TRUE) returns a little matrix of rows and columns - nice!

Data Manipulation (dplyr)

The library(dplyr) is a grammar of data manipulation. It is written in C++ so you get the speed of C with the convenience of R. It is in essence the data frame to data frame portion of plyr (plyr was the original Split-Apply-Combine). May want to look in to count, transmute, and other verbs added post this summary.

The examples use data(hflights) from library(hflights):

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:purrr':
## 
##     order_by
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(hflights)
data(hflights)
head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
summary(hflights)
##       Year          Month          DayofMonth      DayOfWeek    
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000  
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000  
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000  
##  Mean   :2011   Mean   : 6.514   Mean   :15.74   Mean   :3.948  
##  3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:6.000  
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000  
##                                                                 
##     DepTime        ArrTime     UniqueCarrier        FlightNum   
##  Min.   :   1   Min.   :   1   Length:227496      Min.   :   1  
##  1st Qu.:1021   1st Qu.:1215   Class :character   1st Qu.: 855  
##  Median :1416   Median :1617   Mode  :character   Median :1696  
##  Mean   :1396   Mean   :1578                      Mean   :1962  
##  3rd Qu.:1801   3rd Qu.:1953                      3rd Qu.:2755  
##  Max.   :2400   Max.   :2400                      Max.   :7290  
##  NA's   :2905   NA's   :3066                                    
##    TailNum          ActualElapsedTime    AirTime         ArrDelay      
##  Length:227496      Min.   : 34.0     Min.   : 11.0   Min.   :-70.000  
##  Class :character   1st Qu.: 77.0     1st Qu.: 58.0   1st Qu.: -8.000  
##  Mode  :character   Median :128.0     Median :107.0   Median :  0.000  
##                     Mean   :129.3     Mean   :108.1   Mean   :  7.094  
##                     3rd Qu.:165.0     3rd Qu.:141.0   3rd Qu.: 11.000  
##                     Max.   :575.0     Max.   :549.0   Max.   :978.000  
##                     NA's   :3622      NA's   :3622    NA's   :3622     
##     DepDelay          Origin              Dest              Distance     
##  Min.   :-33.000   Length:227496      Length:227496      Min.   :  79.0  
##  1st Qu.: -3.000   Class :character   Class :character   1st Qu.: 376.0  
##  Median :  0.000   Mode  :character   Mode  :character   Median : 809.0  
##  Mean   :  9.445                                         Mean   : 787.8  
##  3rd Qu.:  9.000                                         3rd Qu.:1042.0  
##  Max.   :981.000                                         Max.   :3904.0  
##  NA's   :2905                                                            
##      TaxiIn           TaxiOut         Cancelled       CancellationCode  
##  Min.   :  1.000   Min.   :  1.00   Min.   :0.00000   Length:227496     
##  1st Qu.:  4.000   1st Qu.: 10.00   1st Qu.:0.00000   Class :character  
##  Median :  5.000   Median : 14.00   Median :0.00000   Mode  :character  
##  Mean   :  6.099   Mean   : 15.09   Mean   :0.01307                     
##  3rd Qu.:  7.000   3rd Qu.: 18.00   3rd Qu.:0.00000                     
##  Max.   :165.000   Max.   :163.00   Max.   :1.00000                     
##  NA's   :3066      NA's   :2947                                         
##     Diverted       
##  Min.   :0.000000  
##  1st Qu.:0.000000  
##  Median :0.000000  
##  Mean   :0.002853  
##  3rd Qu.:0.000000  
##  Max.   :1.000000  
## 

The “tbl” is a special type of data frame, which is very helpful for printing:

  • tbl_df(myFrame) # can store or whatever - will be a tbl_df, tbl, and data.frame
    • Display is modified to fit the window display - will scale with the window
  • glimpse(myFrame) # lets you see al the variables and first few records for each (sort of like str)
  • as.data.frame(tbl_df(myFrame)) # this will be the data frame
    • identical(as.data.frame(tbl_df(hflights)), hflights) # FALSE
    • sum(is.na(as.data.frame(tbl_df(hflights))) != is.na(hflights)) # 0
    • sum(as.data.frame(tbl_df(hflights)) != hflights, na.rm=TRUE) # 0

An interesting way to do a lookup table:

  • two <- c(“AA”, “AS”)
  • lut <- c(“AA” = “American”, “AS” = “Alaska”, “B6” = “JetBlue”)
  • two <- lut[two]
  • two

See for example:

lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
         "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
         "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
         "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa"
         )
hflights$Carrier <- lut[hflights$UniqueCarrier]  
glimpse(hflights)  
## Observations: 227,496
## Variables: 22
## $ Year              (int) 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month             (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth        (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek         (int) 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime           (int) 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime           (int) 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier     (chr) "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum         (int) 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum           (chr) "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime (int) 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime           (int) 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay          (int) -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay          (int) 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin            (chr) "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest              (chr) "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance          (int) 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn            (int) 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut           (int) 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled         (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode  (chr) "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted          (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Carrier           (chr) "American", "American", "American", "America...

There are five main verbs in dplyr:

  • select - subset of columns from a dataset
    • select(df, . . . ) where . . . Are the columns to be kept
    • starts_with(“X”): every name that starts with “X”,
    • ends_with(“X”): every name that ends with “X”,
    • contains(“X”): every name that contains “X”,
    • matches(“X”): every name that matches “X”, where “X” can be a regular expression,
    • num_range(“x”, 1:5): the variables named x01, x02, x03, x04 and x05,
    • one_of(x): every name that appears in x, which should be a character vector.
    • filter - subset of rows from a dataset
      • filter(df, .) where … are 1+ logical tests (so make sure to use == or all.equal() or the like)
  • arrange - reorder rows in a dataset
    • arrange(df, .) where . are the colunns to reorder by
  • mutate - create new columns in a dataset
    • mutate(df, .) where each . is a formula for a new variable to be created
  • summarize - create summary statistics for a dataset
    • summarize(df, .) where each . is a formula like newVar = thisEquation
      • only aggregate functions (vector as input, single number as output) should be used
    • dplyr adds several additional aggregate functions such as first, last, nth, n, n_distinct
      • first(x) - The first element of vector x.
      • last(x) - The last element of vector x.
      • nth(x, n) - The nth element of vector x.
      • n() - The number of rows in the data.frame or group of observations that summarise() describes.
      • n_distinct(x) - The number of unique values in vector x.
  • In general:
    • select and mutate operate on the variables
    • filter and arrange operate on the observations
    • summarize operates on groups of observations
    • All of these are much cleaner if the data are tidy
  • There is also the option to use chaining %>% to process multiple commands
    • Especially useful for memory storage and readability
    • The pipe operator (%>%) comes from the magrittr package by Stefan Bache
    • object %>% function(object will go first)
    • c(1, 2, 3) %>% sum() # 6
    • c(1, 2, 3, NA) %>% mean(na.rm=TRUE) # 2

There is also the group_by capability for summaries of sub-groups:

  • group_by(df, .) where the . is what to group the data by
    • The magic is when you run summarize() on data with group_by run on it; results will be by group
    • If you have group_by(df, a, b) %>% summarize(timeSum = sum(time)) # all observations by a-b
    • If you have group_by(df, a, b) %>% summarize(timeSum = sum(time)) %>% summarize(timeA = sum(timeSum)) # all observations by a
    • If you have group_by(df, a, b) %>% summarize(timeSum = sum(time)) %>% summarize(timeA = sum(timeSum)) %>% summarize(timeAll = sum(timeA)) # all observations

The dplyr library can also work with databases. It only loads the data that you need, and you do not need to know the relevant SQL code – dplyr writes the SQL code for you.

Basic select and mutate examples include:

data(hflights)

# Make it faster, as well as a prettier printer
hflights <- tbl_df(hflights)
hflights
## Source: local data frame [227,496 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
class(hflights)
## [1] "tbl_df"     "tbl"        "data.frame"
# Select examples
select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay)
## Source: local data frame [227,496 x 4]
## 
##    ActualElapsedTime AirTime ArrDelay DepDelay
##                (int)   (int)    (int)    (int)
## 1                 60      40      -10        0
## 2                 60      45       -9        1
## 3                 70      48       -8       -8
## 4                 70      39        3        3
## 5                 62      44       -3        5
## 6                 64      45       -7       -1
## 7                 70      43       -1       -1
## 8                 59      40      -16       -5
## 9                 71      41       44       43
## 10                70      45       43       43
## ..               ...     ...      ...      ...
select(hflights, Origin:Cancelled)
## Source: local data frame [227,496 x 6]
## 
##    Origin  Dest Distance TaxiIn TaxiOut Cancelled
##     (chr) (chr)    (int)  (int)   (int)     (int)
## 1     IAH   DFW      224      7      13         0
## 2     IAH   DFW      224      6       9         0
## 3     IAH   DFW      224      5      17         0
## 4     IAH   DFW      224      9      22         0
## 5     IAH   DFW      224      9       9         0
## 6     IAH   DFW      224      6      13         0
## 7     IAH   DFW      224     12      15         0
## 8     IAH   DFW      224      7      12         0
## 9     IAH   DFW      224      8      22         0
## 10    IAH   DFW      224      6      19         0
## ..    ...   ...      ...    ...     ...       ...
select(hflights, Year:DayOfWeek, ArrDelay:Diverted)
## Source: local data frame [227,496 x 14]
## 
##     Year Month DayofMonth DayOfWeek ArrDelay DepDelay Origin  Dest
##    (int) (int)      (int)     (int)    (int)    (int)  (chr) (chr)
## 1   2011     1          1         6      -10        0    IAH   DFW
## 2   2011     1          2         7       -9        1    IAH   DFW
## 3   2011     1          3         1       -8       -8    IAH   DFW
## 4   2011     1          4         2        3        3    IAH   DFW
## 5   2011     1          5         3       -3        5    IAH   DFW
## 6   2011     1          6         4       -7       -1    IAH   DFW
## 7   2011     1          7         5       -1       -1    IAH   DFW
## 8   2011     1          8         6      -16       -5    IAH   DFW
## 9   2011     1          9         7       44       43    IAH   DFW
## 10  2011     1         10         1       43       43    IAH   DFW
## ..   ...   ...        ...       ...      ...      ...    ...   ...
## Variables not shown: Distance (int), TaxiIn (int), TaxiOut (int),
##   Cancelled (int), CancellationCode (chr), Diverted (int)
select(hflights, ends_with("Delay"))
## Source: local data frame [227,496 x 2]
## 
##    ArrDelay DepDelay
##       (int)    (int)
## 1       -10        0
## 2        -9        1
## 3        -8       -8
## 4         3        3
## 5        -3        5
## 6        -7       -1
## 7        -1       -1
## 8       -16       -5
## 9        44       43
## 10       43       43
## ..      ...      ...
select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancel"))
## Source: local data frame [227,496 x 5]
## 
##    UniqueCarrier FlightNum TailNum Cancelled CancellationCode
##            (chr)     (int)   (chr)     (int)            (chr)
## 1             AA       428  N576AA         0                 
## 2             AA       428  N557AA         0                 
## 3             AA       428  N541AA         0                 
## 4             AA       428  N403AA         0                 
## 5             AA       428  N492AA         0                 
## 6             AA       428  N262AA         0                 
## 7             AA       428  N493AA         0                 
## 8             AA       428  N477AA         0                 
## 9             AA       428  N476AA         0                 
## 10            AA       428  N504AA         0                 
## ..           ...       ...     ...       ...              ...
select(hflights, ends_with("Time"), ends_with("Delay"))
## Source: local data frame [227,496 x 6]
## 
##    DepTime ArrTime ActualElapsedTime AirTime ArrDelay DepDelay
##      (int)   (int)             (int)   (int)    (int)    (int)
## 1     1400    1500                60      40      -10        0
## 2     1401    1501                60      45       -9        1
## 3     1352    1502                70      48       -8       -8
## 4     1403    1513                70      39        3        3
## 5     1405    1507                62      44       -3        5
## 6     1359    1503                64      45       -7       -1
## 7     1359    1509                70      43       -1       -1
## 8     1355    1454                59      40      -16       -5
## 9     1443    1554                71      41       44       43
## 10    1443    1553                70      45       43       43
## ..     ...     ...               ...     ...      ...      ...
# Mutate example
m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_ratio = loss / DepDelay)
class(m1)
## [1] "tbl_df"     "tbl"        "data.frame"
m1
## Source: local data frame [227,496 x 23]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int), loss (int), loss_ratio (dbl)
glimpse(m1)
## Observations: 227,496
## Variables: 23
## $ Year              (int) 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month             (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth        (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek         (int) 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime           (int) 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime           (int) 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier     (chr) "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum         (int) 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum           (chr) "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime (int) 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime           (int) 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay          (int) -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay          (int) 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin            (chr) "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest              (chr) "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance          (int) 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn            (int) 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut           (int) 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled         (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode  (chr) "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted          (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ loss              (int) -10, -10, 0, 0, -8, -6, 0, -11, 1, 0, 0, -14...
## $ loss_ratio        (dbl) -Inf, -10.00000000, 0.00000000, 0.00000000, ...

Additionally, examples for filter and arrange:

# Examples for filter

filter(hflights, Distance >= 3000)  # All flights that traveled 3000 miles or more
## Source: local data frame [527 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1         31         1     924    1413            CO
## 2   2011     1         30         7     925    1410            CO
## 3   2011     1         29         6    1045    1445            CO
## 4   2011     1         28         5    1516    1916            CO
## 5   2011     1         27         4     950    1344            CO
## 6   2011     1         26         3     944    1350            CO
## 7   2011     1         25         2     924    1337            CO
## 8   2011     1         24         1    1144    1605            CO
## 9   2011     1         23         7     926    1335            CO
## 10  2011     1         22         6     942    1340            CO
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
filter(hflights, UniqueCarrier %in% c("B6", "WN", "DL"))
## Source: local data frame [48,679 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1          1         6     654    1124            B6
## 2   2011     1          1         6    1639    2110            B6
## 3   2011     1          2         7     703    1113            B6
## 4   2011     1          2         7    1604    2040            B6
## 5   2011     1          3         1     659    1100            B6
## 6   2011     1          3         1    1801    2200            B6
## 7   2011     1          4         2     654    1103            B6
## 8   2011     1          4         2    1608    2034            B6
## 9   2011     1          5         3     700    1103            B6
## 10  2011     1          5         3    1544    1954            B6
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
filter(hflights, (TaxiIn + TaxiOut) > AirTime)  # Flights where taxiing took longer than flying
## Source: local data frame [1,389 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1         24         1     731     904            AA
## 2   2011     1         30         7    1959    2132            AA
## 3   2011     1         24         1    1621    1749            AA
## 4   2011     1         10         1     941    1113            AA
## 5   2011     1         31         1    1301    1356            CO
## 6   2011     1         31         1    2113    2215            CO
## 7   2011     1         31         1    1434    1539            CO
## 8   2011     1         31         1     900    1006            CO
## 9   2011     1         30         7    1304    1408            CO
## 10  2011     1         30         7    2004    2128            CO
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
filter(hflights, DepTime < 500 | ArrTime > 2200)  # Flights departed before 5am or arrived after 10pm
## Source: local data frame [27,799 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1          4         2    2100    2207            AA
## 2   2011     1         14         5    2119    2229            AA
## 3   2011     1         10         1    1934    2235            AA
## 4   2011     1         26         3    1905    2211            AA
## 5   2011     1         30         7    1856    2209            AA
## 6   2011     1          9         7    1938    2228            AS
## 7   2011     1         31         1    1919    2231            CO
## 8   2011     1         31         1    2116    2344            CO
## 9   2011     1         31         1    1850    2211            CO
## 10  2011     1         31         1    2102    2216            CO
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
filter(hflights, DepDelay > 0, ArrDelay < 0)  # Flights that departed late but arrived ahead of schedule
## Source: local data frame [27,712 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1          2         7    1401    1501            AA
## 2   2011     1          5         3    1405    1507            AA
## 3   2011     1         18         2    1408    1508            AA
## 4   2011     1         18         2     721     827            AA
## 5   2011     1         12         3    2015    2113            AA
## 6   2011     1         13         4    2020    2116            AA
## 7   2011     1         26         3    2009    2103            AA
## 8   2011     1          1         6    1631    1736            AA
## 9   2011     1         10         1    1639    1740            AA
## 10  2011     1         12         3    1631    1739            AA
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
filter(hflights, Cancelled == 1, DepDelay > 0) # Flights that were cancelled after being delayed
## Source: local data frame [40 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1         26         3    1926      NA            CO
## 2   2011     1         11         2    1100      NA            US
## 3   2011     1         19         3    1811      NA            XE
## 4   2011     1          7         5    2028      NA            XE
## 5   2011     2          4         5    1638      NA            AA
## 6   2011     2          8         2    1057      NA            CO
## 7   2011     2          2         3     802      NA            XE
## 8   2011     2          9         3     904      NA            XE
## 9   2011     2          1         2    1508      NA            OO
## 10  2011     3         31         4    1016      NA            CO
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
c1 <- filter(hflights, Dest == "JFK")  # Flights that had JFK as their destination: c1
c2 <- mutate(c1, Date = paste(Year, Month, DayofMonth, sep="-"))  # Create a Date column: c2
select(c2, Date, DepTime, ArrTime, TailNum)  # Print out a selection of columns of c2
## Source: local data frame [695 x 4]
## 
##        Date DepTime ArrTime TailNum
##       (chr)   (int)   (int)   (chr)
## 1  2011-1-1     654    1124  N324JB
## 2  2011-1-1    1639    2110  N324JB
## 3  2011-1-2     703    1113  N324JB
## 4  2011-1-2    1604    2040  N324JB
## 5  2011-1-3     659    1100  N229JB
## 6  2011-1-3    1801    2200  N206JB
## 7  2011-1-4     654    1103  N267JB
## 8  2011-1-4    1608    2034  N267JB
## 9  2011-1-5     700    1103  N708JB
## 10 2011-1-5    1544    1954  N644JB
## ..      ...     ...     ...     ...
dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay))  # Definition of dtc


# Examples for arrange

arrange(dtc, DepDelay)  # Arrange dtc by departure delays
## Source: local data frame [68 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     7         23         6     605      NA            F9
## 2   2011     1         17         1     916      NA            XE
## 3   2011    12          1         4     541      NA            US
## 4   2011    10         12         3    2022      NA            MQ
## 5   2011     7         29         5    1424      NA            CO
## 6   2011     9         29         4    1639      NA            OO
## 7   2011     2          9         3     555      NA            MQ
## 8   2011     5          9         1     715      NA            OO
## 9   2011     1         20         4    1413      NA            UA
## 10  2011     1         17         1     831      NA            WN
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
arrange(dtc, CancellationCode)  # Arrange dtc so that cancellation reasons are grouped
## Source: local data frame [68 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1         20         4    1413      NA            UA
## 2   2011     1          7         5    2028      NA            XE
## 3   2011     2          4         5    1638      NA            AA
## 4   2011     2          8         2    1057      NA            CO
## 5   2011     2          1         2    1508      NA            OO
## 6   2011     2         21         1    2257      NA            OO
## 7   2011     2          9         3     555      NA            MQ
## 8   2011     3         18         5     727      NA            UA
## 9   2011     4          4         1    1632      NA            DL
## 10  2011     4          8         5    1608      NA            WN
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
arrange(dtc, UniqueCarrier, DepDelay)  # Arrange dtc according to carrier and departure delays
## Source: local data frame [68 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     8         18         4    1808      NA            AA
## 2   2011     2          4         5    1638      NA            AA
## 3   2011     7         29         5    1424      NA            CO
## 4   2011     1         26         3    1703      NA            CO
## 5   2011     8         11         4    1320      NA            CO
## 6   2011     7         25         1    1654      NA            CO
## 7   2011     1         26         3    1926      NA            CO
## 8   2011     3         31         4    1016      NA            CO
## 9   2011     2          8         2    1057      NA            CO
## 10  2011     4          4         1    1632      NA            DL
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
arrange(hflights, UniqueCarrier, desc(DepDelay))  # Arrange by carrier and decreasing departure delays
## Source: local data frame [227,496 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011    12         12         1     650     808            AA
## 2   2011    11         19         6    1752    1910            AA
## 3   2011    12         22         4    1728    1848            AA
## 4   2011    10         23         7    2305       2            AA
## 5   2011     9         27         2    1206    1300            AA
## 6   2011     3         17         4    1647    1747            AA
## 7   2011     6         21         2     955    1315            AA
## 8   2011     5         20         5    2359     130            AA
## 9   2011     4         19         2    2023    2142            AA
## 10  2011     5         12         4    2133      53            AA
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
arrange(hflights, DepDelay + ArrDelay)  # Arrange flights by total delay (normal order)
## Source: local data frame [227,496 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     7          3         7    1914    2039            XE
## 2   2011     8         31         3     934    1039            OO
## 3   2011     8         21         7     935    1039            OO
## 4   2011     8         28         7    2059    2206            OO
## 5   2011     8         29         1     935    1041            OO
## 6   2011    12         25         7     741     926            OO
## 7   2011     1         30         7     620     812            OO
## 8   2011     8          3         3    1741    1810            XE
## 9   2011     8          4         4     930    1041            OO
## 10  2011     8         18         4     939    1043            OO
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

Additionally, examples for the summarize verb:

# Print out a summary with variables min_dist and max_dist
summarize(hflights, min_dist = min(Distance), max_dist = max(Distance))
## Source: local data frame [1 x 2]
## 
##   min_dist max_dist
##      (int)    (int)
## 1       79     3904
# Print out a summary with variable max_div
summarize(filter(hflights, Diverted == 1), max_div = max(Distance))
## Source: local data frame [1 x 1]
## 
##   max_div
##     (int)
## 1    3904
# Remove rows that have NA ArrDelay: temp1
temp1 <- filter(hflights, !is.na(ArrDelay))

# Generate summary about ArrDelay column of temp1
summarize(temp1, earliest=min(ArrDelay), average=mean(ArrDelay), latest=max(ArrDelay), sd=sd(ArrDelay))
## Source: local data frame [1 x 4]
## 
##   earliest  average latest       sd
##      (int)    (dbl)  (int)    (dbl)
## 1      -70 7.094334    978 30.70852
# Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2
temp2 <- filter(hflights, !is.na(TaxiIn), !is.na(TaxiOut))

# Print the maximum taxiing difference of temp2 with summarise()
summarize(temp2, max_taxi_diff = max(abs(TaxiIn - TaxiOut)))
## Source: local data frame [1 x 1]
## 
##   max_taxi_diff
##           (int)
## 1           160
# Generate summarizing statistics for hflights
summarize(hflights, n_obs = n(), n_carrier = n_distinct(UniqueCarrier), n_dest = n_distinct(Dest))
## Source: local data frame [1 x 3]
## 
##    n_obs n_carrier n_dest
##    (int)     (int)  (int)
## 1 227496        15    116
# All American Airline flights
aa <- filter(hflights, UniqueCarrier == "AA")

# Generate summarizing statistics for aa 
summarize(aa, n_flights = n(), n_canc = sum(Cancelled), avg_delay = mean(ArrDelay, na.rm=TRUE))
## Source: local data frame [1 x 3]
## 
##   n_flights n_canc avg_delay
##       (int)  (int)     (dbl)
## 1      3244     60 0.8917558

Additionally, examples for the pipe/chain as per magrittr:

# Find the average delta in taxi times
hflights %>%
    mutate(diff = (TaxiOut - TaxiIn)) %>%
    filter(!is.na(diff)) %>%
    summarize(avg = mean(diff))
## Source: local data frame [1 x 1]
## 
##        avg
##      (dbl)
## 1 8.992064
# Find flights that average less than 70 mph assuming 100 wasted minutes per flight
hflights %>%
    mutate(RealTime = ActualElapsedTime + 100, mph = 60 * Distance / RealTime) %>%
    filter(!is.na(mph), mph < 70) %>%
    summarize(n_less = n(), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance))
## Source: local data frame [1 x 4]
## 
##   n_less n_dest min_dist max_dist
##    (int)  (int)    (int)    (int)
## 1   6726     13       79      305
# Find flights that average less than 105 mph, or that are diverted/cancelled
hflights %>%
  mutate(RealTime = ActualElapsedTime + 100, mph = Distance / RealTime * 60) %>%
  filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>%
  summarize(n_non = n(), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance))
## Source: local data frame [1 x 4]
## 
##   n_non n_dest min_dist max_dist
##   (int)  (int)    (int)    (int)
## 1 42400    113       79     3904
# Find overnight flights
filter(hflights, !is.na(DepTime), !is.na(ArrTime), DepTime > ArrTime) %>%
    summarize(num = n())
## Source: local data frame [1 x 1]
## 
##     num
##   (int)
## 1  2718

There is also the group_by capability, typically for use with summarize:

# Make an ordered per-carrier summary of hflights
group_by(hflights, UniqueCarrier) %>%
    summarize(p_canc = 100 * mean(Cancelled, na.rm=TRUE), avg_delay = mean(ArrDelay, na.rm=TRUE)) %>%
    arrange(avg_delay, p_canc)
## Source: local data frame [15 x 3]
## 
##    UniqueCarrier    p_canc  avg_delay
##            (chr)     (dbl)      (dbl)
## 1             US 1.1268986 -0.6307692
## 2             AA 1.8495684  0.8917558
## 3             FL 0.9817672  1.8536239
## 4             AS 0.0000000  3.1923077
## 5             YV 1.2658228  4.0128205
## 6             DL 1.5903067  6.0841374
## 7             CO 0.6782614  6.0986983
## 8             MQ 2.9044750  7.1529751
## 9             EV 3.4482759  7.2569543
## 10            WN 1.5504047  7.5871430
## 11            F9 0.7159905  7.6682692
## 12            XE 1.5495599  8.1865242
## 13            OO 1.3946828  8.6934922
## 14            B6 2.5899281  9.8588410
## 15            UA 1.6409266 10.4628628
# Ordered overview of average arrival delays per carrier
hflights %>%
    filter(!is.na(ArrDelay), ArrDelay > 0) %>%
    group_by(UniqueCarrier) %>%
    summarize(avg = mean(ArrDelay)) %>%
    mutate(rank = rank(avg)) %>%
    arrange(rank)
## Source: local data frame [15 x 3]
## 
##    UniqueCarrier      avg  rank
##            (chr)    (dbl) (dbl)
## 1             YV 18.67568     1
## 2             F9 18.68683     2
## 3             US 20.70235     3
## 4             CO 22.13374     4
## 5             AS 22.91195     5
## 6             OO 24.14663     6
## 7             XE 24.19337     7
## 8             WN 25.27750     8
## 9             FL 27.85693     9
## 10            AA 28.49740    10
## 11            DL 32.12463    11
## 12            UA 32.48067    12
## 13            MQ 38.75135    13
## 14            EV 40.24231    14
## 15            B6 45.47744    15
# How many airplanes only flew to one destination?
hflights %>%
  group_by(TailNum) %>%
  summarise(destPerTail = n_distinct(Dest)) %>%
  filter(destPerTail == 1) %>%
  summarise(nplanes=n())
## Source: local data frame [1 x 1]
## 
##   nplanes
##     (int)
## 1    1526
# Find the most visited destination for each carrier
hflights %>%
  group_by(UniqueCarrier, Dest) %>%
  summarise(n = n()) %>%
  mutate(rank = rank(-n)) %>%
  filter(rank == 1)
## Source: local data frame [15 x 4]
## Groups: UniqueCarrier [15]
## 
##    UniqueCarrier  Dest     n  rank
##            (chr) (chr) (int) (dbl)
## 1             AA   DFW  2105     1
## 2             AS   SEA   365     1
## 3             B6   JFK   695     1
## 4             CO   EWR  3924     1
## 5             DL   ATL  2396     1
## 6             EV   DTW   851     1
## 7             F9   DEN   837     1
## 8             FL   ATL  2029     1
## 9             MQ   DFW  2424     1
## 10            OO   COS  1335     1
## 11            UA   SFO   643     1
## 12            US   CLT  2212     1
## 13            WN   DAL  8243     1
## 14            XE   CRP  3175     1
## 15            YV   CLT    71     1
# Use summarise to calculate n_carrier
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, last
## The following object is masked from 'package:purrr':
## 
##     transpose
hflights2 <- as.data.table(hflights)
hflights2 %>%
    summarize(n_carrier = n_distinct(UniqueCarrier))
##    n_carrier
## 1:        15

And, dplyr can be used with databases, including writing the SQL query that matches to the dplyr request. The results are cached to avoid constantly pinging the server:

# Set up a connection to the mysql database
my_db <- src_mysql(dbname = "dplyr", 
                   host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                   port = 3306, 
                   user = "student",
                   password = "datacamp")

# Reference a table within that source: nycflights
nycflights <- tbl(my_db, "dplyr")

# glimpse at nycflights
glimpse(nycflights)
## Observations: 336,776
## Variables: 17
## $ id        (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ year      (int) 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013...
## $ month     (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ day       (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ dep_time  (int) 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55...
## $ dep_delay (int) 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,...
## $ arr_time  (int) 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8...
## $ arr_delay (int) 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,...
## $ carrier   (chr) "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"...
## $ tailnum   (chr) "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N...
## $ flight    (int) 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301...
## $ origin    (chr) "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG...
## $ dest      (chr) "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA...
## $ air_time  (int) 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149...
## $ distance  (int) 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73...
## $ hour      (int) 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6...
## $ minute    (int) 17, 33, 42, 44, 54, 54, 55, 57, 57, 58, 58, 58, 58, ...
# Ordered, grouped summary of nycflights
nycflights %>%
    group_by(carrier) %>%
    summarize(n_flights = n(), avg_delay = mean(arr_delay)) %>%
    arrange(avg_delay)
## Source: mysql 5.6.23-log [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## From: <derived table> [?? x 3]
## Arrange: avg_delay
## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric
##    carrier n_flights avg_delay
##      (chr)     (dbl)     (dbl)
## 1       AS       714   -9.8613
## 2       HA       342   -6.9152
## 3       AA     32729    0.3556
## 4       DL     48110    1.6289
## 5       VX      5162    1.7487
## 6       US     20536    2.0565
## 7       UA     58665    3.5045
## 8       9E     18460    6.9135
## 9       B6     54635    9.3565
## 10      WN     12275    9.4675
## ..     ...       ...       ...

Data Manipulation (data.table)

The data.table library is designed to simplify and speed up work with large datasets. The language is broadly analogous to SQL, with syntax that includes equivalents for SELECT, WHERE, and GROUP BY. Some general attributes of a data.table object include:

  • Set of columns; every column is the same length but may be of different type
  • Goal #1: Reduce programming time (fewer function calls, less variable name repetition)
  • Goal #2: Reduce compute time (fast aggregation, update by reference
  • Currently in-memory (64-bit and 100 GB is routine; one-quarter-terabyte RAM is available through Amazon EC2 for a few dollars per hours)
  • Ordered joins (useful in finance/time series and also genomics)

NOTE - all data.table are also data.frame, and if a package is not aware of data.table, then it will act as data.frame for that package.

General syntax is:

  • myDataTable[condition, data/transforms, order by]
    • Extracts all rows that meet condition, provides the requested data/transforms, and orders accordingly
    • Analogous to SQL - WHERE, SELECT, GROUP BY
    • DT[i, j, by]

Example table creation:

  • DT <- data.table(A = 1:6, B=c(“a”, “b”, “c”), C=rnorm(6), D=TRUE)
    • “We like character vectors in data.table”
    • Need to use 1L for integer, and NA_integer_ for NA/integer (rather than boolean)
    • DT[3:5, ] is the same as DT[3:5] – either will return rows 3-5
    • Note that .N contains the number of rows
  • Select columns in data.table (second argument)
    • .(B, C) is the same as list(B, C) and will select the columns named “B” and “C”
    • .(mysum = sum(B)) will sum the entirety of column B for the rows requested and call the column mysum
    • .(B, C= sum(C)) will recycle sum(C) everywhere and also pull B
    • DT[,plot(A, C)] will plot A vs C
    • DT[ , B] will return a VECTOR and not a data.table
    • DT[ , .(B)] will return a data.table
  • Using a by variable allows for sum/mean/etc. by grouping:
    • DT[ , .(mysum = sum(B)), by=.(C)] will sum column B BY each C for the rows requested, and call the column mysum
    • DT[ , .(mysum = sum(B)), by=.(myMod = C%%2)] will sum column B BY each Cmod2 for the rows requested, and call the column mysum
    • Can skip the .() if you have just a single SELECT or a single GROUP BY
      • Order depends on what it finds first – not necessarily sorted, just aggregated BY

Some example code includes:

library(data.table)

DT <- data.table(a = c(1, 2), b=LETTERS[1:4])
str(DT)
## Classes 'data.table' and 'data.frame':   4 obs. of  2 variables:
##  $ a: num  1 2 1 2
##  $ b: chr  "A" "B" "C" "D"
##  - attr(*, ".internal.selfref")=<externalptr>
DT
##    a b
## 1: 1 A
## 2: 2 B
## 3: 1 C
## 4: 2 D
# Print the second to last row of DT using .N
DT[.N-1]
##    a b
## 1: 1 C
# Print the column names of DT
names(DT)
## [1] "a" "b"
# Print the number or rows and columns of DT
dim(DT)
## [1] 4 2
# Select row 2 twice and row 3, returning a data.table with three rows where row 2 is a duplicate of row 1.
DT[c(2, 2:3)]
##    a b
## 1: 2 B
## 2: 2 B
## 3: 1 C
DT <- data.table(A = 1:5, B = letters[1:5], C = 6:10)
str(DT)
## Classes 'data.table' and 'data.frame':   5 obs. of  3 variables:
##  $ A: int  1 2 3 4 5
##  $ B: chr  "a" "b" "c" "d" ...
##  $ C: int  6 7 8 9 10
##  - attr(*, ".internal.selfref")=<externalptr>
DT
##    A B  C
## 1: 1 a  6
## 2: 2 b  7
## 3: 3 c  8
## 4: 4 d  9
## 5: 5 e 10
# Subset rows 1 and 3, and columns B and C
DT[c(1, 3), .(B, C)]
##    B C
## 1: a 6
## 2: c 8
# Assign to ans the correct value
ans <- DT[ , .(B, val=A*C)]
ans
##    B val
## 1: a   6
## 2: b  14
## 3: c  24
## 4: d  36
## 5: e  50
# Fill in the blanks such that ans2 equals target
target <- data.table(B = c("a", "b", "c", "d", "e", "a", "b", "c", "d", "e"), 
                     val = as.integer(c(6:10, 1:5))
                     )
ans2 <- DT[, .(B, val = c(C, A))]
identical(target, ans2)
## [1] TRUE
DT <- as.data.table(iris)
str(DT)
## Classes 'data.table' and 'data.frame':   150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# For each Species, print the mean Sepal.Length
DT[ , mean(Sepal.Length), Species]
##       Species    V1
## 1:     setosa 5.006
## 2: versicolor 5.936
## 3:  virginica 6.588
# Print mean Sepal.Length, grouping by first letter of Species
DT[ , mean(Sepal.Length), substr(Species, 1, 1)]
##    substr    V1
## 1:      s 5.006
## 2:      v 6.262
str(DT)
## Classes 'data.table' and 'data.frame':   150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
identical(DT, as.data.table(iris))
## [1] TRUE
# Group the specimens by Sepal area (to the nearest 10 cm2) and count how many occur in each group.
DT[, .N, by = 10 * round(Sepal.Length * Sepal.Width / 10)]
##    round   N
## 1:    20 117
## 2:    10  29
## 3:    30   4
# Now name the output columns `Area` and `Count`
DT[, .(Count=.N), by = .(Area = 10 * round(Sepal.Length * Sepal.Width / 10))]
##    Area Count
## 1:   20   117
## 2:   10    29
## 3:   30     4
# Create the data.table DT
set.seed(1L)
DT <- data.table(A = rep(letters[2:1], each = 4L), 
                 B = rep(1:4, each = 2L), 
                 C = sample(8)
                 )
str(DT)
## Classes 'data.table' and 'data.frame':   8 obs. of  3 variables:
##  $ A: chr  "b" "b" "b" "b" ...
##  $ B: int  1 1 2 2 3 3 4 4
##  $ C: int  3 8 4 5 1 7 2 6
##  - attr(*, ".internal.selfref")=<externalptr>
DT
##    A B C
## 1: b 1 3
## 2: b 1 8
## 3: b 2 4
## 4: b 2 5
## 5: a 3 1
## 6: a 3 7
## 7: a 4 2
## 8: a 4 6
# Create the new data.table, DT2
DT2 <- DT[, .(C = cumsum(C)), by = .(A, B)]
str(DT2)
## Classes 'data.table' and 'data.frame':   8 obs. of  3 variables:
##  $ A: chr  "b" "b" "b" "b" ...
##  $ B: int  1 1 2 2 3 3 4 4
##  $ C: int  3 11 4 9 1 8 2 8
##  - attr(*, ".internal.selfref")=<externalptr>
DT2
##    A B  C
## 1: b 1  3
## 2: b 1 11
## 3: b 2  4
## 4: b 2  9
## 5: a 3  1
## 6: a 3  8
## 7: a 4  2
## 8: a 4  8
# Select from DT2 the last two values from C while you group by A
DT2[, .(C = tail(C, 2)), by = A]
##    A C
## 1: b 4
## 2: b 9
## 3: a 2
## 4: a 8

The chaining operation in data.table is run as [statement][next statement].

  • The .SD means “Subset of Data”
    • By default, .SD means all of the columns other than the columns specified in by (and only accessible in j)
    • DT[ , lapply(.SD, median), by = Species]
    • Recall that .() is just an alias to a list, so it is not needed for lapply (which always returns a list anyway)
  • The := operator is for adding by reference
    • If it already exists, it is updated as per the call
    • If it does not already exist, it is created
    • DT[ , c(“x”, “z”) := .(rev(x), 10:6)] # will reverse x and create z as 10-9-8-7-6]
    • Anything with := NULL will remove the columns instantly
    • DT[ , MyCols :=NULL] will look for a column called MyCols
    • DT[, (MyCols) := NULL] will use whatever MyCols references, allowing for MyCols to be a variable
    • DT[2:4, z:=sum(y), by=x] # Will create z as requested for rows 2:4 and create z=NA everywhere else; interesting (and risky perhaps .)
  • The set() syntax is another option:
    • for (i in 1:5) DT[i, z := i+1]
    • for (i in 1:5) set(DT, i, 3L, i+1]) # take DT, act on column 3 (happens to be z in this example) and makes it i+1
  • The setnames() syntax is yet another option
    • setnames(DT, “old”, “new”)
  • The setcolorder() syntax is yet another option
    • setcolorder(DT, c(new_order))
  • A wrap up of the set() family:
    • set() is a loopable, low overhead version of :=
    • You can use setnames() to set or change column names
    • setcolorder() lets you reorder the columns of a data.table

Example code includes:

set.seed(1L)
DT <- data.table(A = rep(letters[2:1], each = 4L), 
                 B = rep(1:4, each = 2L), 
                 C = sample(8)) 
str(DT)
## Classes 'data.table' and 'data.frame':   8 obs. of  3 variables:
##  $ A: chr  "b" "b" "b" "b" ...
##  $ B: int  1 1 2 2 3 3 4 4
##  $ C: int  3 8 4 5 1 7 2 6
##  - attr(*, ".internal.selfref")=<externalptr>
DT
##    A B C
## 1: b 1 3
## 2: b 1 8
## 3: b 2 4
## 4: b 2 5
## 5: a 3 1
## 6: a 3 7
## 7: a 4 2
## 8: a 4 6
# Perform operation using chaining
DT[ , .(C = cumsum(C)), by = .(A, B)][ , .(C = tail(C, 2)), by=.(A)]
##    A C
## 1: b 4
## 2: b 9
## 3: a 2
## 4: a 8
data(iris)
DT <- as.data.table(iris)
str(DT)
## Classes 'data.table' and 'data.frame':   150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Perform chained operations on DT
DT[ , .(Sepal.Length = median(Sepal.Length), Sepal.Width = median(Sepal.Width), 
        Petal.Length = median(Petal.Length), Petal.Width = median(Petal.Width)), 
        by=.(Species)][order(-Species)]
##       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:  virginica          6.5         3.0         5.55         2.0
## 2: versicolor          5.9         2.8         4.35         1.3
## 3:     setosa          5.0         3.4         1.50         0.2
# Mean of columns
# DT[ , lapply(.SD, FUN=mean), by=.(x)]

# Median of columns
# DT[ , lapply(.SD, FUN=median), by=.(x)]

# Calculate the sum of the Q columns
# DT[ , lapply(.SD, FUN=sum), , .SDcols=2:4]

# Calculate the sum of columns H1 and H2 
# DT[ , lapply(.SD, FUN=sum), , .SDcols=paste0("H", 1:2)]

# Select all but the first row of groups 1 and 2, returning only the grp column and the Q columns
# foo = function(x) { x[-1] }
# DT[ , lapply(.SD, FUN=foo), by=.(grp), .SDcols=paste0("Q", 1:3)]

# Sum of all columns and the number of rows
# DT[, c(lapply(.SD, FUN=sum), .N), by=.(x), .SDcols=names(DT)]

# Cumulative sum of column x and y while grouping by x and z > 8
# DT[, lapply(.SD, FUN=cumsum), by=.(by1=x, by2=(z>8)), .SDcols=c("x", "y")]

# Chaining
# DT[, lapply(.SD, FUN=cumsum), by=.(by1=x, by2=(z>8)), .SDcols=c("x", "y")][ , lapply(.SD, FUN=max), by=.(by1), .SDcols=c("x", "y")]


# The data.table DT
DT <- data.table(A = letters[c(1, 1, 1, 2, 2)], B = 1:5)
str(DT)
## Classes 'data.table' and 'data.frame':   5 obs. of  2 variables:
##  $ A: chr  "a" "a" "a" "b" ...
##  $ B: int  1 2 3 4 5
##  - attr(*, ".internal.selfref")=<externalptr>
DT
##    A B
## 1: a 1
## 2: a 2
## 3: a 3
## 4: b 4
## 5: b 5
# Add column by reference: Total
DT[ , Total:=sum(B), by=.(A)]
DT
##    A B Total
## 1: a 1     6
## 2: a 2     6
## 3: a 3     6
## 4: b 4     9
## 5: b 5     9
# Add 1 to column B
DT[c(2,4) , B:=B+1L, ]
DT
##    A B Total
## 1: a 1     6
## 2: a 3     6
## 3: a 3     6
## 4: b 5     9
## 5: b 5     9
# Add a new column Total2
DT[2:4, Total2:=sum(B), by=.(A)]
DT
##    A B Total Total2
## 1: a 1     6     NA
## 2: a 3     6      6
## 3: a 3     6      6
## 4: b 5     9      5
## 5: b 5     9     NA
# Remove the Total column
DT[ , Total := NULL, ]
DT
##    A B Total2
## 1: a 1     NA
## 2: a 3      6
## 3: a 3      6
## 4: b 5      5
## 5: b 5     NA
# Select the third column using `[[`
DT[[3]]
## [1] NA  6  6  5 NA
# A data.table DT has been created for you
DT <- data.table(A = c(1, 1, 1, 2, 2), B = 1:5)
str(DT)
## Classes 'data.table' and 'data.frame':   5 obs. of  2 variables:
##  $ A: num  1 1 1 2 2
##  $ B: int  1 2 3 4 5
##  - attr(*, ".internal.selfref")=<externalptr>
DT
##    A B
## 1: 1 1
## 2: 1 2
## 3: 1 3
## 4: 2 4
## 5: 2 5
# Update B, add C and D
DT[ , c("B", "C", "D") := .(B + 1, A + B, 2), ]
DT
##    A B C D
## 1: 1 2 2 2
## 2: 1 3 3 2
## 3: 1 4 4 2
## 4: 2 5 6 2
## 5: 2 6 7 2
# Delete my_cols
my_cols <- c("B", "C")
DT[ , (my_cols) := NULL, ]
DT
##    A D
## 1: 1 2
## 2: 1 2
## 3: 1 2
## 4: 2 2
## 5: 2 2
# Delete column 2 by number
DT[[2]] <- NULL
DT
##    A
## 1: 1
## 2: 1
## 3: 1
## 4: 2
## 5: 2
# Set the seed
# set.seed(1)

# Check the DT that is made available to you
# DT

# For loop with set
# for(i in 2:4) { set(DT, sample(nrow(DT), 3), i, NA) }

# Change the column names to lowercase
# setnames(DT, letters[1:4])

# Print the resulting DT to the console
# DT

# Define DT
DT <- data.table(a = letters[c(1, 1, 1, 2, 2)], b = 1)
str(DT)
## Classes 'data.table' and 'data.frame':   5 obs. of  2 variables:
##  $ a: chr  "a" "a" "a" "b" ...
##  $ b: num  1 1 1 1 1
##  - attr(*, ".internal.selfref")=<externalptr>
DT
##    a b
## 1: a 1
## 2: a 1
## 3: a 1
## 4: b 1
## 5: b 1
# Add a suffix "_2" to all column names
setnames(DT, paste0(names(DT), "_2"))
DT
##    a_2 b_2
## 1:   a   1
## 2:   a   1
## 3:   a   1
## 4:   b   1
## 5:   b   1
# Change column name "a_2" to "A2"
setnames(DT, "a_2", "A2")
DT
##    A2 b_2
## 1:  a   1
## 2:  a   1
## 3:  a   1
## 4:  b   1
## 5:  b   1
# Reverse the order of the columns
setcolorder(DT, 2:1)
DT
##    b_2 A2
## 1:   1  a
## 2:   1  a
## 3:   1  a
## 4:   1  b
## 5:   1  b
  • Section 8 - Indexing (using column names in i)
    • DT[A == “a”] # returns only the rows where column A has value “a”
    • w <- DT[, A == “a”] # creates a new variable w that is the boolean evaluation of A == “a”
      • Note that this is a vector and not a list since it is not wrapped in .()
    • DT[w] will return the same thing as DT[A == “a”]
    • The data.table() package automatically creates an index the first time you use the variable
      • DT[A == “a”] # takes however long it needs
      • DT[A == “b”] # now runs very fast since it is indexed
  • Section 9 - creating and using a key
    • setkey(DT, varname)
    • DT[“b”] # will find where varname that has been set as key is equal to “b”
    • DT[“b”, mult=“first”] # will return only the first match
    • DT[“b”, mult=“last”] # will return only the last match
    • If one of the requested keys is not found, a row with NA is returned
      • DT[c(“b”, “d”)] could return an NA
      • DT[c(“b”, “d”), nomatch = 0] will never return an NA; instead it will just skip the rows
    • If you create setkey(DT, A, B) then it will be indexed by both A and B
      • DT[.(“b”, 5)] # this will pull rows where A == “b” and B == 5
  • Section 10 - Rolling joins (typically used for time series)
    • DT[.(“b”, 4), roll=TRUE] # If there is a “b”, 4 then it will find it; if not, then it will find the closest previous match
    • DT[.(“b”, 4), roll=“nearest”] # If there is a “b”, 4 then it will find it; if not, then it will find the nearest match
    • DT[.(“b”, 4), roll=+Inf] # If there is a “b”, 4 then it will find it; if not, then it will find the closest previous match
    • DT[.(“b”, 4), roll=-Inf] # If there is a “b”, 4 then it will find it; if not, then it will find the closest succeeding match
    • DT[.(“b”, 4), roll=2] # If there is a “b”, 4 then it will find it; if not, then it will find the closest previous match provided it was within 2
    • DT[.(“b”, 4), roll=-2] # If there is a “b”, 4 then it will find it; if not, then it will find the closest succeeding match provided it was within 2
    • DT[.(“b”, 4), roll=TRUE, rollends=FALSE] # If there is a “b”, 4 then it will find it; if not, then it will find the closest previous match, except it will not go beyond the data

Example code includes:

# iris as a data.table
iris <- as.data.table(iris)

# Remove the "Sepal." prefix
names(iris) <- gsub("Sepal\\.", "", names(iris))

# Remove the two columns starting with "Petal"
iris[, c("Petal.Length", "Petal.Width") := NULL, ]

# Cleaned up iris data.table
str(iris)
## Classes 'data.table' and 'data.frame':   150 obs. of  3 variables:
##  $ Length : num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Width  : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Area is greater than 20 square centimeters
iris[ Width * Length > 20 ]
##     Length Width    Species
##  1:    5.4   3.9     setosa
##  2:    5.8   4.0     setosa
##  3:    5.7   4.4     setosa
##  4:    5.4   3.9     setosa
##  5:    5.7   3.8     setosa
##  6:    5.2   4.1     setosa
##  7:    5.5   4.2     setosa
##  8:    7.0   3.2 versicolor
##  9:    6.4   3.2 versicolor
## 10:    6.9   3.1 versicolor
## 11:    6.3   3.3 versicolor
## 12:    6.7   3.1 versicolor
## 13:    6.7   3.0 versicolor
## 14:    6.0   3.4 versicolor
## 15:    6.7   3.1 versicolor
## 16:    6.3   3.3  virginica
## 17:    7.1   3.0  virginica
## 18:    7.6   3.0  virginica
## 19:    7.3   2.9  virginica
## 20:    7.2   3.6  virginica
## 21:    6.5   3.2  virginica
## 22:    6.8   3.0  virginica
## 23:    6.4   3.2  virginica
## 24:    7.7   3.8  virginica
## 25:    7.7   2.6  virginica
## 26:    6.9   3.2  virginica
## 27:    7.7   2.8  virginica
## 28:    6.7   3.3  virginica
## 29:    7.2   3.2  virginica
## 30:    7.2   3.0  virginica
## 31:    7.4   2.8  virginica
## 32:    7.9   3.8  virginica
## 33:    7.7   3.0  virginica
## 34:    6.3   3.4  virginica
## 35:    6.9   3.1  virginica
## 36:    6.7   3.1  virginica
## 37:    6.9   3.1  virginica
## 38:    6.8   3.2  virginica
## 39:    6.7   3.3  virginica
## 40:    6.7   3.0  virginica
## 41:    6.2   3.4  virginica
##     Length Width    Species
# Add new boolean column
iris[, is_large := Width * Length > 25]
## Warning in `[.data.table`(iris, , `:=`(is_large, Width * Length > 25)):
## Invalid .internal.selfref detected and fixed by taking a (shallow) copy
## of the data.table so that := can add this new column by reference. At
## an earlier point, this data.table has been copied by R (or been created
## manually using structure() or similar). Avoid key<-, names<- and attr<-
## which in R currently (and oddly) may copy the whole data.table. Use set*
## syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in
## R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to
## copy named objects); please upgrade to R>v3.0.2 if that is biting. If this
## message doesn't help, please report to datatable-help so the root cause can
## be fixed.
# Now large observations with is_large
iris[is_large == TRUE]
##    Length Width   Species is_large
## 1:    5.7   4.4    setosa     TRUE
## 2:    7.2   3.6 virginica     TRUE
## 3:    7.7   3.8 virginica     TRUE
## 4:    7.9   3.8 virginica     TRUE
iris[(is_large)] # Also OK
##    Length Width   Species is_large
## 1:    5.7   4.4    setosa     TRUE
## 2:    7.2   3.6 virginica     TRUE
## 3:    7.7   3.8 virginica     TRUE
## 4:    7.9   3.8 virginica     TRUE
# The 'keyed' data.table DT
DT <- data.table(A = letters[c(2, 1, 2, 3, 1, 2, 3)], 
                 B = c(5, 4, 1, 9, 8, 8, 6), 
                 C = 6:12)
setkey(DT, A, B)
str(DT)
## Classes 'data.table' and 'data.frame':   7 obs. of  3 variables:
##  $ A: chr  "a" "a" "b" "b" ...
##  $ B: num  4 8 1 5 8 6 9
##  $ C: int  7 10 8 6 11 12 9
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr  "A" "B"
DT
##    A B  C
## 1: a 4  7
## 2: a 8 10
## 3: b 1  8
## 4: b 5  6
## 5: b 8 11
## 6: c 6 12
## 7: c 9  9
# Select the "b" group
DT["b"]
##    A B  C
## 1: b 1  8
## 2: b 5  6
## 3: b 8 11
# "b" and "c" groups
DT[c("b", "c")]
##    A B  C
## 1: b 1  8
## 2: b 5  6
## 3: b 8 11
## 4: c 6 12
## 5: c 9  9
# The first row of the "b" and "c" groups
DT[c("b", "c"), mult = "first"]
##    A B  C
## 1: b 1  8
## 2: c 6 12
# First and last row of the "b" and "c" groups
DT[c("b", "c"), .SD[c(1, .N)], by = .EACHI]
##    A B  C
## 1: b 1  8
## 2: b 8 11
## 3: c 6 12
## 4: c 9  9
# Copy and extend code for instruction 4: add printout
DT[c("b", "c"), { print(.SD); .SD[c(1, .N)] }, by = .EACHI]
##    B  C
## 1: 1  8
## 2: 5  6
## 3: 8 11
##    B  C
## 1: 6 12
## 2: 9  9
##    A B  C
## 1: b 1  8
## 2: b 8 11
## 3: c 6 12
## 4: c 9  9
# Keyed data.table DT
DT <- data.table(A = letters[c(2, 1, 2, 3, 1, 2, 3)], 
                 B = c(5, 4, 1, 9, 8, 8, 6), 
                 C = 6:12, 
                 key = "A,B")
str(DT)
## Classes 'data.table' and 'data.frame':   7 obs. of  3 variables:
##  $ A: chr  "a" "a" "b" "b" ...
##  $ B: num  4 8 1 5 8 6 9
##  $ C: int  7 10 8 6 11 12 9
##  - attr(*, "sorted")= chr  "A" "B"
##  - attr(*, ".internal.selfref")=<externalptr>
DT
##    A B  C
## 1: a 4  7
## 2: a 8 10
## 3: b 1  8
## 4: b 5  6
## 5: b 8 11
## 6: c 6 12
## 7: c 9  9
# Get the key of DT
key(DT)
## [1] "A" "B"
# Row where A == "b" and B == 6
DT[.("b", 6)]
##    A B  C
## 1: b 6 NA
# Return the prevailing row
DT[.("b", 6), roll=TRUE]
##    A B C
## 1: b 6 6
# Return the nearest row
DT[.("b", 6), roll="nearest"]
##    A B C
## 1: b 6 6
# Keyed data.table DT
DT <- data.table(A = letters[c(2, 1, 2, 3, 1, 2, 3)], 
                 B = c(5, 4, 1, 9, 8, 8, 6), 
                 C = 6:12, 
                 key = "A,B")
str(DT)
## Classes 'data.table' and 'data.frame':   7 obs. of  3 variables:
##  $ A: chr  "a" "a" "b" "b" ...
##  $ B: num  4 8 1 5 8 6 9
##  $ C: int  7 10 8 6 11 12 9
##  - attr(*, "sorted")= chr  "A" "B"
##  - attr(*, ".internal.selfref")=<externalptr>
DT
##    A B  C
## 1: a 4  7
## 2: a 8 10
## 3: b 1  8
## 4: b 5  6
## 5: b 8 11
## 6: c 6 12
## 7: c 9  9
# Print the sequence (-2):10 for the "b" group
DT[.("b", (-2):10)]
##     A  B  C
##  1: b -2 NA
##  2: b -1 NA
##  3: b  0 NA
##  4: b  1  8
##  5: b  2 NA
##  6: b  3 NA
##  7: b  4 NA
##  8: b  5  6
##  9: b  6 NA
## 10: b  7 NA
## 11: b  8 11
## 12: b  9 NA
## 13: b 10 NA
# Add code: carry the prevailing values forwards
DT[.("b", (-2):10), roll=TRUE]
##     A  B  C
##  1: b -2 NA
##  2: b -1 NA
##  3: b  0 NA
##  4: b  1  8
##  5: b  2  8
##  6: b  3  8
##  7: b  4  8
##  8: b  5  6
##  9: b  6  6
## 10: b  7  6
## 11: b  8 11
## 12: b  9 11
## 13: b 10 11
# Add code: carry the first observation backwards
DT[.("b", (-2):10), roll=TRUE, rollends=TRUE]
##     A  B  C
##  1: b -2  8
##  2: b -1  8
##  3: b  0  8
##  4: b  1  8
##  5: b  2  8
##  6: b  3  8
##  7: b  4  8
##  8: b  5  6
##  9: b  6  6
## 10: b  7  6
## 11: b  8 11
## 12: b  9 11
## 13: b 10 11

Data Manipulation (xts and zoo)

Jeff Ryan, the creator of quantmod and organizer of the R/Finance conference, has developed xts and zoo to simplify working with time series data. The course will cover five areas (chapters):

  • Chapter 1: Create, import, and export time series
  • Chapter 2: Subset, extract, and more
  • Chapter 3: Merge and modify time series
  • Chapter 4: Apply and aggregate by time
  • Chapter 5: Advanced and extra features of xts

“xts” stands for extensible time series. The core of each “xts” is a “zoo” object, consisting of a matrix plus an index.

  • Basically, it is data plus an array of times
    • x <- matrix(data=1:4, ncol=2)
    • idx <- as.Date(c(“2015-01-01”, “2015-02-01”))
      • The idx needs to be “time based”, though the type of time object can be flexible - Date, POSIX times, timeData, chron, . . .
      • The index should be in increasing order of time (earlier at the type)
  • The xts functions allow for joining the index and the data
    • X <- xts(x, order.by = idx) # Can add arguments unique=TRUE (force times to be unique) and tzone=“” to override the system time-zone
    • If the “idx” that you passed is not sorted ascending (earliest times first), the xts call will sort both the “x” and the “idx” such that the resulting xts object is ascending in time

There are a few special behaviors of xts:

  • The xts object is a matrix with associated times for each object
  • Subsets will preserve the matrix form (even if taking just a single row or a single column – no drop=TRUE default)
  • Attributes are (generally) preserved even when you subset
  • The “xts” object is a subset of “zoo”, meaning that it preserves all the power of the “zoo” capability

The “xts” object can be de-constructed when needed:

  • coredata(x, fmt=FALSE) brings back the matrix
  • index(x) brings back the index

Data usually already exists and needs to be “wrangled” in to a proper format for xts/zoo. The easiest way to convert is using as.xts(). You can coerce truly external data after loading it, and can also save data with Can also save with write.zoo(x, “file”).

Subsetting based on time is a particular strength of xts. xts supports ISO8601:2004 (the standard, “right way”, to unambiguously consider times):

  • Moving left-to-right for the most significant to least significant impact
  • YYYY-MM-DDTHH:MM:SS format
  • Specifying only the year (e.g., 2014) is fine, while specifying only the month (e.g., “02”) is not

xts allows for four methods of specifying dates or intervals:

  1. One and two-sided intervals (“2004” or “2001/2005”)
  2. Truncated representation (“201402/03”)
  3. Time support (“2014-02-22 08:30:00”)
  4. Repeating intervals (“T08:00/T09:00”)

Can also use some traditional R-like methods (since xts extends zoo, and zoo extends base R):

  • Integer indexing - x[c(1, 2, 3), ]
  • Logical vectors - x[index(x) > “2016-08-20”]
  • Date objects - x[index(as.POSIXct(c(“2016-06-25”, “2016-06-27”)))]

Can set the flag which.i = TRUE to get back the correct records (row numbers). For example, index <- x[“2007-06-26/2007-06-28”, which.i = TRUE].

Description of key behaviors when working with an xts object:

  • All subsets will preserve the matrix (drop=FALSE)
  • Order is always preserved - cannot intentionally or uninetntionally reorder the data - requesting c(1, 2) or c(2, 1) returns the same thing
  • Binary search and memcpy are leveraged, meaning that it works faster than base R
  • Index and object attributes are always preserved

xts introduces a few relatives of the head() and tail() functionality. These are the first() and last() functions.

  • first(edhec[, “Funds of Funds”], “4 months”)
  • last(edhec[, “Funds of Funds”], “1 year”)
  • Can uses a negative to mean “except”, such as “-4 months”
  • The first() and last() can be nested within one another

Math operations using xts - xts is a matrix - need to be careful about matrix operations. Math operations are run only on the intersection of items:

  • Only the intersecting observations will be (for example) added together – others are dropped!
  • Sometimes it may be necessary to drop the xts class – drop=TRUE, coredata(), as.numeric(), etc.
  • Special handling (described in the next chapter) may be needed when you want the union of dates

Merging time series is common. Merge (cbind, merge) combines by columns, but joining based on index.

  • Syntax is merge (
  • fill is available to allow missing values to be coerced as needed
  • If you merge(x, as.Date(c(“2016-08-14”))) then you will have a new date (2016-08-14) in your database

Merge (rbind( combine by rows, though all rows must already have an index. Basically, the rbind MUST be used on a time series.

Missing data is common, and xts inherits all of the zoo methods for dealing with missing data. The locf is the “last observation carry forward” (latest value that is not NA) - called with na.locf:

  • na.locf(object, na.rm=TRUE, fromLast = FALSE, maxgap = Inf)
  • Generic function for replacing each NA with the most recent non-NA prior to it.

The NA can be managed in several ways:

  • na.fill(object, fill, . ) # fill the NA with the fill value
  • na.trim(object, . ) # remove NA that are at the beginning or end
  • na.omit(object, . ) # remove all NA
  • na.approx(object, . ) # interpolate NA based on distance from object

Lag operators and difference operations. Seasonality is a repeating pattern. There is often a need to compare seasonality – for example, compare Mondays. Stationarity refers to some bound of the series.

The lag() function will change the timestamp, so that (for example) today can be merged as last week:

  • lag(x, k=1, na.pad=TRUE, . ) # positive k will shift the values FORWARD
  • Note that base R and zoo are the opposite, where lag(k=) means move forward
  • This is not what the literature recommends, and zoo follows the literature, with k= shifting time forward

The “one period lag first difference” is calculated as diff(x, lag=1, differences=1, arithmetic=TRUE, log=FALSE, na.pad=TRUE, . ).

There are two main approaches for applying functions on discrete periods or intervals:

  • period.apply(x, INDEX, FUN, . )
    • INDEX should be a vector of end-points of a period
    • The end-point will be the last observation per interval
      • endpoints(x, on=“years”) will create an endpoint vector by year ## can be “days” or “seconds” or the like; always starts with 0
    • data(PerformanceAnalytics::edhec); edhec_4yr <- edhec[“1997/2001”]; ep <- endpoints(edhec_4yr, “years”); period.apply(edhec_4yr, INDEX=ep, FUN=mean)
    • There are shortcut functions like apply.yearly() which take care of all the indexing and endpoints automatically
  • split(x, f=“months”)
    • This will split the data by month
    • Outcome would be a list by months

Time series aggregation can also be handled by xts:

  • Useful to convert a univariate series to range bars (OHLC = Open, High, Low, Close)
    • Provides a summary of a particular period - start, max, min, end
    • to.period(x, period=“months”, k=1, indexAt, name=NULL, OHLC=TRUE, . )
      • indexAt lets you adjust labelling of outputs (default is end of period), while name lets you define the roots used in the columns
    • to.period(edhec[“1997/2001”, 1], “years”, name=“EDHEC”)
    • to.period(edhec[“1997/2001”, 1], “years”, name=“EDHEC”, indexAt=“firstof”)
    • to.period(edhec[“1997/2001”, 1], “years”, name=“EDHEC”, OHLC=FALSE) # will pull the last observation only

Time series data can also be managed in a “rolling” manner - discrete or continuous:

  • Discrete rolling windows would be things like “month to date”
    • split() followed by lapply() using FUN=cumsum, cumprod, cummin, cummax
    • edhec.yrs <- split(edhec[, 1], f=“years”)
    • edhec.yrs <- lapply(edhec.yrs, FUN=cumsum)
    • edhec.ytd <- do.call(rbind, edhec.yrs)
  • Continuous rolling windows are managed through:
    • rollapply(data, width, FUN, . , by=1, by.column = TRUE, fill= if (na.pad) NA, na.pad=TRUE, partial=TRUE, align=c(“right”, “center”, “left”))

Internals of xts such as indices and timezones:

  • The index is always stored as fractional seconds since midnight 1970-01-01 UTC
  • xts will use tclass (attribute for extraction) - if you passed in a date, you get back a date – indexClass()
  • xts will use tzone (attribute for time zone) – indexTZ()
  • xts will use indexFormat (attribute for optional display preferences) – indexFormat() <-
  • Sys.setenv(TZ = “America/Chicago”)
    • help(OlsonNames)

Final topics:

  • Periodicity - identify underlying regularity in the data (what type of data do we have)
    • May be irregular data, so this is just an estimate – periodicity()
  • Counting – number of discrete periods (unique endpoints) – note that monthly data has the same answer for ndays() and nmonths()
    • Only makes sense to count periods if the data have HIGHER frequency than what you are trying to count
  • Broken down time can be extracted with .index
    • index(Z); .indexmday(Z) # month day; .indexyday(Z) # year day; .indexyear(Z) + 1900
  • Can align timing – align.time(x, n=60) # n is in seconds
    • make.index.unique(x, eps=1e-06, drop=FALSE, fromLast=FALSE, . ) will help to manage duplicates

Example code includes:

library(xts)
## Warning: package 'xts' was built under R version 3.2.5
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following object is masked from 'package:data.table':
## 
##     last
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(zoo)

x <- matrix(data=1:4, ncol=2)
idx <- as.Date(c("2015-01-01", "2015-02-01"))

# Create the xts
X <- xts(x, order.by = idx)

# Decosntruct the xts
coredata(X, fmt=FALSE)
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
index(X)
## [1] "2015-01-01" "2015-02-01"
# Working with the sunspots data
data(sunspots)
class(sunspots)
## [1] "ts"
sunspots_xts <- as.xts(sunspots)
class(sunspots_xts)
## [1] "xts" "zoo"
head(sunspots_xts)
##          [,1]
## Jan 1749 58.0
## Feb 1749 62.6
## Mar 1749 70.0
## Apr 1749 55.7
## May 1749 85.0
## Jun 1749 83.5
# Example from chapter #1
ex_matrix <- xts(matrix(data=c(1, 1, 1, 2, 2, 2), ncol=2), 
                 order.by=as.Date(c("2016-06-01", "2016-06-02", "2016-06-03"))
                 )
core <- coredata(ex_matrix)

# View the structure of ex_matrix
str(ex_matrix)
## An 'xts' object on 2016-06-01/2016-06-03 containing:
##   Data: num [1:3, 1:2] 1 1 1 2 2 2
##   Indexed by objects of class: [Date] TZ: UTC
##   xts Attributes:  
##  NULL
# Extract the 3rd observation of the 2nd column of ex_matrix
ex_matrix[3, 2]
##            [,1]
## 2016-06-03    2
# Extract the 3rd observation of the 2nd column of core 
core[3, 2]
## [1] 2
# Create the object data using 5 random numbers
data <- rnorm(5)

# Create dates as a Date class object starting from 2016-01-01
dates <- seq(as.Date("2016-01-01"), length = 5, by = "days")

# Use xts() to create smith
smith <- xts(x = data, order.by = dates)

# Create bday (1899-05-08) using a POSIXct date class object
bday <- as.POSIXct("1899-05-08")

# Create hayek and add a new attribute called born
hayek <- xts(x = data, order.by = dates, born = bday)

# Extract the core data of hayek
hayek_core <- coredata(hayek)

# View the class of hayek_core
class(hayek_core)
## [1] "matrix"
# Extract the index of hayek
hayek_index <- index(hayek)

# View the class of hayek_index
class(hayek_index)
## [1] "Date"
# Create dates
dates <- as.Date("2016-01-01") + 0:4

# Create ts_a
ts_a <- xts(x = 1:5, order.by = dates)

# Create ts_b
ts_b <- xts(x = 1:5, order.by = as.POSIXct(dates))

# Extract the rows of ts_a using the index of ts_b
ts_a[index(ts_b)]
##            [,1]
## 2016-01-01    1
## 2016-01-02    2
## 2016-01-03    3
## 2016-01-04    4
## 2016-01-05    5
# Extract the rows of ts_b using the index of ts_a
ts_b[index(ts_a)]
##      [,1]
data(austres)

# Convert austres to an xts object called au
au <- as.xts(austres)

# Convert your xts object (au) into a matrix am
am <- as.matrix(au)

# Convert the original austres into a matrix am2
am2 <- as.matrix(austres)

# Create dat by reading tmp_file
tmp_file <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1127/datasets/tmp_file.csv"
dat <- read.csv(tmp_file)  

# Convert dat into xts
xts(dat, order.by = as.Date(rownames(dat), "%m/%d/%Y"))
##            a b
## 2015-01-02 1 3
## 2015-02-03 2 4
# Read tmp_file using read.zoo
dat_zoo <- read.zoo(tmp_file, index.column = 0, sep = ",", format = "%m/%d/%Y")

# Convert dat_zoo to xts
dat_xts <- as.xts(dat_zoo)

# Convert sunspots to xts using as.xts(). Save this as sunspots_xts
sunspots_xts <- as.xts(sunspots)

# Get the temporary file name
tmp <- tempfile()

# Write the xts object using zoo to tmp 
write.zoo(sunspots_xts, sep = ",", file = tmp)

# Read the tmp file. FUN = as.yearmon converts strings such as Jan 1749 into a proper time class
sun <- read.zoo(tmp, sep = ",", FUN = as.yearmon)

# Convert sun into xts. Save this as sun_xts
sun_xts <- as.xts(sun)



data(edhec, package="PerformanceAnalytics")

head(edhec["2007-01", 1])
##            Convertible Arbitrage
## 2007-01-31                 0.013
head(edhec["2007-01/2007-03", 1])
##            Convertible Arbitrage
## 2007-01-31                0.0130
## 2007-02-28                0.0117
## 2007-03-31                0.0060
head(edhec["200701/03", 1])
##            Convertible Arbitrage
## 2007-01-31                0.0130
## 2007-02-28                0.0117
## 2007-03-31                0.0060
first(edhec[, "Funds of Funds"], "4 months")
##            Funds of Funds
## 1997-01-31         0.0317
## 1997-02-28         0.0106
## 1997-03-31        -0.0077
## 1997-04-30         0.0009
last(edhec[, "Funds of Funds"], "1 year")
##            Funds of Funds
## 2009-01-31         0.0060
## 2009-02-28        -0.0037
## 2009-03-31         0.0008
## 2009-04-30         0.0092
## 2009-05-31         0.0312
## 2009-06-30         0.0024
## 2009-07-31         0.0153
## 2009-08-31         0.0113

Data Visualization (ggplot2 part 1)

Data visulaization is the combination of Statistics and Design:

  • Aids in communication and perception
  • Exploratory plots are designed for a very small audience (just yourself even) and can be very dense
  • Explanatory plots are designed to synthesize and communicate broadly - very labor intensive to create

The Anscombe plot examples show four different datasets explained by the identical linear model. This reinforces the importance of plotting the data prior to running analyses and drawing conclusions.

The “Grammar of Graphics” is a plotting framework based on the book by Leland Wilkinson, “Grammar of Graphics” 2(1999). The gist is that graphics are made of distinct layers of grammatical elements. Meaningful plots are created through aesthetic mapping.

Essential Grammatical Elements include:

  • MANDATORY (this course) - Data (dataset for plotting), Aesthetics (scales for mapping the data), Geometries (visual elements)
  • OPTIONAL (next course) - Facets (plotting small multiples), Statistics, Coordinates (space for plotting), Themes (non-data ink)

The ggplot2 package was one of the first developed and designed by Hadley Wickham. It implements the “Grammar of Graphics” in R, for example with:

  • Data: iris
  • Aesthetic: x=Sepal.Length, y=Sepal.Width
  • Geometries: geom_jitter(alpha = 0.6)
  • Facets: facet_grid(. ~ Species)
  • Statistics: stat_smooth(method = “lm”, se = F, col=“red”)
  • Coordinates/Themes: to be explored in a later course

The Anscombe data is good to have plotted for reference:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.5
data(anscombe)

ansX <- with(anscombe, c(x1, x2, x3, x4))
ansY <- with(anscombe, c(y1, y2, y3, y4))
ansType <- rep(1:4, each=nrow(anscombe))
ansFrame <- data.frame(x=ansX, y=ansY, series=factor(ansType))

# ggplot example for Anscombe data
ggplot(ansFrame, aes(x=x, y=y)) + 
    geom_point() + 
    geom_smooth(method="lm", col="red", se=FALSE, fullrange=TRUE) + 
    facet_wrap(~ series, nrow=2)

As well, the basic example code from above is useful to explore:

data(iris)

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + 
    geom_jitter(alpha = 0.6) + 
    facet_grid(. ~ Species) + 
    stat_smooth(method = "lm", se = FALSE, col="red")

Some additional basic ggplot syntax includes (cached, since plotting each point of the diamonds dataset is taxing for the graphics):

# Explore the mtcars data frame with str()
data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# Execute the following command
ggplot(mtcars, aes(x = cyl, y = mpg)) +
  geom_point()

# Change the command below so that cyl is treated as factor
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_point()

# A scatter plot has been made for you
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

# Replace ___ with the correct vector
ggplot(mtcars, aes(x = wt, y = mpg, col = disp)) +
  geom_point()

# Replace ___ with the correct vector
ggplot(mtcars, aes(x = wt, y = mpg, size = disp)) +
  geom_point()

# Explore the diamonds data frame with str()
data(diamonds)
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
# Add geom_point() with +
ggplot(diamonds, aes(x = carat, y = price)) + geom_point()

# Add geom_point() and geom_smooth() with +
ggplot(diamonds, aes(x = carat, y = price)) + geom_point() + geom_smooth()

# The plot you created in the previous exercise
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point() +
  geom_smooth()

# Copy the above command but show only the smooth line
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_smooth()

# Copy the above command and assign the correct value to col in aes()
ggplot(diamonds, aes(x = carat, y = price, col=clarity)) +
  geom_smooth()

# Keep the color settings from previous command. Plot only the points with argument alpha.
ggplot(diamonds, aes(x = carat, y = price, col=clarity)) +
  geom_point(alpha = 0.4)

# Create the object containing the data and aes layers: dia_plot
dia_plot <- ggplot(diamonds, aes(x = carat, y=price))

# Add a geom layer with + and geom_point()
dia_plot + geom_point()

# Add the same geom layer, but with aes() inside
dia_plot + geom_point(aes(col = clarity))

set.seed(1)
# The dia_plot object has been created for you
dia_plot <- ggplot(diamonds, aes(x = carat, y = price))

# Expand dia_plot by adding geom_point() with alpha set to 0.2
dia_plot <- dia_plot + geom_point(alpha = 0.2)

# Plot dia_plot with additional geom_smooth() with se set to FALSE
dia_plot + geom_smooth(se = FALSE)

# Copy the command from above and add aes() with the correct mapping to geom_smooth()
dia_plot + geom_smooth(se = FALSE, aes(col = clarity))

Data Layer - How data structure influences plots (ggplot2 vs. base):

  • Scatterplot in base: plot(iris\(Sepal.Length, iris\)Sepal.width); points(iris\(Petal.Length, iris\)Petal.Width, col=“red”)
  • Many limitations to the base approach, including
    • Plot does not get redrawn, meaning some of the Petal Length/Width end up “off plot”
    • Plot is drawn solely as an image; it is not an object that we can retain/manipulate
    • Need to manually add/adjust the legend, a potential source for errors
    • No unified framework for plotting (different functions for each plot type)
      Differences in ggplot
    • All plots are always called with ggplot(, aes()) + geom_(), plus additional layers as needed
    • The ggplot can be stored as an object using the assignment operator ( <- )
  • Note that the base package is generally fine for a very simple plot

Can add some additional points (similar to points() in base, but with axes rescaling for you):

  • ggplot() + geom_point() + geom_point(aes(), col=“red”)
    • Plotting space is adjusted
    • This is due to ggplot2 having created an object (which can thus be manipulated)
  • While the above works, it is really a misuse of the grammar of graphics - “please never do this!”
    • For example, there is no legend, and the axis labels are incorrect
  • In reality, the X axis should be “length” and the Y axis should be width
    • Suppose that you make iris.wide, a 300x4 consisting of Species, Part (Petal, Sepal), Length, Width) instead of the original 150x5
    • So now, color can be based on Part, with X = Length and Y = Width
  • When the data is made tidy, there are many advantages in the plotting
    • Tidy data for iris is format Species - Part (Sepal or Petal) - Measure (Length or Width) - Value

Example code includes:

data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# Plot the correct variables of mtcars
plot(mtcars$wt, mtcars$mpg, col=mtcars$cyl)

# Change cyl inside mtcars to a factor
mtcars$cyl <- as.factor(mtcars$cyl)

# Make the same plot as in the first instruction
plot(mtcars$wt, mtcars$mpg, col=mtcars$cyl)

# Use lm() to calculate a linear model and save it as carModel
carModel <- lm(mpg ~ wt, data = mtcars)

# Call abline() with carModel as first argument and set lty to 2
abline(carModel, lty=2)

# Plot each subset efficiently with lapply
# You don't have to edit this code
lapply(mtcars$cyl, function(x) {
  abline(lm(mpg ~ wt, mtcars, subset = (cyl == x)), col = x)
  })
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL
## 
## [[13]]
## NULL
## 
## [[14]]
## NULL
## 
## [[15]]
## NULL
## 
## [[16]]
## NULL
## 
## [[17]]
## NULL
## 
## [[18]]
## NULL
## 
## [[19]]
## NULL
## 
## [[20]]
## NULL
## 
## [[21]]
## NULL
## 
## [[22]]
## NULL
## 
## [[23]]
## NULL
## 
## [[24]]
## NULL
## 
## [[25]]
## NULL
## 
## [[26]]
## NULL
## 
## [[27]]
## NULL
## 
## [[28]]
## NULL
## 
## [[29]]
## NULL
## 
## [[30]]
## NULL
## 
## [[31]]
## NULL
## 
## [[32]]
## NULL
# This code will draw the legend of the plot
# You don't have to edit this code
legend(x = 5, y = 33, legend = levels(mtcars$cyl), 
       col = 1:3, pch = 1, bty = "n")

# Plot 1: add geom_point() to this command to create a scatter plot
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
  geom_point()  # Fill in using instructions Plot 1

# Plot 2: include the lines of the linear models, per cyl
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
  geom_point() + # Copy from Plot 1
  geom_smooth(method="lm", se=FALSE)   # Fill in using instructions Plot 2

# Plot 3: include a lm for the entire dataset in its whole
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
  geom_point() + # Copy from Plot 2
  geom_smooth(method="lm", se=FALSE) + # Copy from Plot 2
  geom_smooth(aes(group = 1), method="lm", se=FALSE, linetype = 2)   # Fill in using instructions Plot 3

data(iris)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Option 1
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point() +
  geom_point(aes(x = Petal.Length, y = Petal.Width), col = "red")

# DS code to match up to lecturer data set formats
data(iris)
longIris <- tidyr::gather(iris, Type, Measure, -Species)
intIris <- tidyr::separate(longIris, Type, c("Part", "Metric"))
intIris$rowNum <- c(1:150, 1:150, 151:300, 151:300)
iris.wide <- tidyr::spread(intIris, Metric, Measure)
iris.tidy <- dplyr::select(dplyr::mutate(intIris, Value=Measure, Measure=Metric), Species, Part, Measure, Value)

# Option 2
ggplot(iris.wide, aes(x = Length, y = Width, col = Part)) +
  geom_point()

# Consider the structure of iris, iris.wide and iris.tidy (in that order)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
str(iris.wide)
## 'data.frame':    300 obs. of  5 variables:
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Part   : chr  "Petal" "Petal" "Petal" "Petal" ...
##  $ rowNum : int  151 152 153 154 155 156 157 158 159 160 ...
##  $ Length : num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Width  : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
str(iris.tidy)
## 'data.frame':    600 obs. of  4 variables:
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Part   : chr  "Sepal" "Sepal" "Sepal" "Sepal" ...
##  $ Measure: chr  "Length" "Length" "Length" "Length" ...
##  $ Value  : num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# Think about which dataset you would use to get the plot shown right
# Fill in the ___ to produce the plot given to the right
ggplot(iris.tidy, aes(x = Species, y = Value, col = Part)) +
  geom_jitter() +
  facet_grid(. ~ Measure)

# Load the tidyr package
library(tidyr)

# Fill in the ___ to produce to the correct iris.tidy dataset
iris.tidy <- iris %>%
  gather(key, Value, -Species) %>%
  separate(key, c("Part", "Measure"), "\\.") 

# Consider the head of iris, iris.wide and iris.tidy (in that order)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
head(iris.wide)
##   Species  Part rowNum Length Width
## 1  setosa Petal    151    1.4   0.2
## 2  setosa Petal    152    1.4   0.2
## 3  setosa Petal    153    1.3   0.2
## 4  setosa Petal    154    1.5   0.2
## 5  setosa Petal    155    1.4   0.2
## 6  setosa Petal    156    1.7   0.4
head(iris.tidy)
##   Species  Part Measure Value
## 1  setosa Sepal  Length   5.1
## 2  setosa Sepal  Length   4.9
## 3  setosa Sepal  Length   4.7
## 4  setosa Sepal  Length   4.6
## 5  setosa Sepal  Length   5.0
## 6  setosa Sepal  Length   5.4
# Think about which dataset you would use to get the plot shown right
# Fill in the ___ to produce the plot given to the right
ggplot(iris.wide, aes(x = Length, y = Width, col = Part)) +
  geom_jitter() +
  facet_grid(. ~ Species)

# Add column with unique ids (don't need to change)
iris$Flower <- 1:nrow(iris)

# Fill in the ___ to produce to the correct iris.wide dataset
iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c("Part", "Measure"), "\\.") %>%
  spread(Measure, value)

Visible aesthetics are the cornerstone of the ggplot:

  • Specific to ggplot, “how things look” is an attribute rather than an aesthetic
  • For ggplot, the aesthetic is just the mapping of the variables to the plots
  • For example, on a scatterplot, the mapping of variables to the x-axis and the y-axis is an aesthetic (as is the mapping of a variable to color)
    • It is always a data frame COLUMN that is mapped on to an aesthetic
  • Typical aesthetics include (n.b., many of these can function as EITHER aesthetic mappings OR attributes)
    • x - x-axis position
    • y - y-axis position
    • colour - colour of dots, outlines of other shapes
    • fill - fill colour
    • size - diameter of points, thickness of lines
    • alpha - transparency (0 = transparent, 1=opaque)
    • linetype - line dash pattern
    • labels - direct labels of an item (text on a plot or axes)
    • shape - shape
  • Common mistake is to use attribute rather than aesthetic and/or to overwrite the aesthetic with the attribute

Modifying aesthetics:

  • position “identity”" is the default for a scatterplot - data will be plotted directly as per the aesthetic mapping
    • geom_point(position = “identity”) is technically fine also, though not needed as it is the default
    • geom_point(position = “jitter”) is the same as geom_jitter()
    • Can also define the jitter, such as posn.j <- position_jitter(width = 0.1), followed later by geom_point(position = posn.j)
  • scale functions
    • scale_x
    • scale_y
    • scale_color
    • scale_fill
    • scale_shape
    • scale_linetype
  • limit functions (max and min for example)
  • breaks
  • expand (always a c(multiplicative, additive)), for creating gaps between the data and the axes
  • labels (to explicitly define the labels) . . . Which can also just use the labs() function

Best practices for choosing among the aesthetics (though note that “there is a fair bit of creativity involved”):

  • Jacques Bertin (“Seminology of Graphics”, 1967) and William Cleveland (“Perception of Visual Elements”, 1990s)
  • Form should always follow function (what is the point of the graph)
    • Beautiful plots is not the primary objective
    • Never misrepresent or obscure (do not create plots just to confuse the issue)
    • Always some information loss when translating between raw data and visual representations
  • Colors can be a confusing way to differentiate continuous variables (better for discrete variables)
  • Different axes can create misinterpretations also
  • Using shape without color can be very confusing also

Example code includes:

data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)

# Map cyl to y
ggplot(mtcars, aes(x=mpg, y=cyl)) + geom_point()

# Map cyl to x
ggplot(mtcars, aes(y=mpg, x=cyl)) + geom_point()

# Map cyl to col
ggplot(mtcars, aes(y=mpg, x=wt, col=cyl)) + geom_point()

# Change shape and size of the points in the above plot
ggplot(mtcars, aes(y=mpg, x=wt, col=cyl)) + geom_point(shape=1, size=4)

# Map cyl to fill
ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
  geom_point()

# Change shape, size and alpha of the points in the above plot
ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
  geom_point(shape=16, size=6, alpha=0.6)

# Map cyl to size
ggplot(mtcars, aes(y=mpg, x=wt, size=cyl)) + geom_point()
## Warning: Using size for a discrete variable is not advised.

# Map cyl to alpha
ggplot(mtcars, aes(y=mpg, x=wt, alpha=cyl)) + geom_point()

# Map cyl to shape 
ggplot(mtcars, aes(y=mpg, x=wt, shape=cyl)) + geom_point()

# Map cyl to labels
ggplot(mtcars, aes(y=mpg, x=wt, label=cyl)) + geom_text()

# Define a hexadecimal color
my_color <- "#123456"

# Set the color aesthetic 
ggplot(mtcars, aes(x=wt, y=mpg, col=cyl)) + geom_point()

# Set the color aesthetic and attribute 
ggplot(mtcars, aes(x=wt, y=mpg, col=cyl)) + geom_point(col = my_color)

# Set the fill aesthetic and color, size and shape attributes
ggplot(mtcars, aes(x=wt, y=mpg, fill=cyl)) + geom_point(size=10, shape=23, col=my_color)

# Expand to draw points with alpha 0.5
ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) + geom_point(alpha=0.5)

# Expand to draw points with shape 24 and color yellow
ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) + geom_point(shape=24, col="yellow")

# Expand to draw text with label x, color red and size 10
ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) + geom_text(label="x", col="red", size=10)

# Map mpg onto x, qsec onto y and factor(cyl) onto col
ggplot(mtcars, aes(x=mpg, y=qsec, col=factor(cyl))) + geom_point()

# Add mapping: factor(am) onto shape
ggplot(mtcars, aes(x=mpg, y=qsec, col=factor(cyl), shape=factor(am))) + geom_point()

# Add mapping: (hp/wt) onto size
ggplot(mtcars, aes(x=mpg, y=qsec, col=factor(cyl), shape=factor(am), size=(hp/wt))) + geom_point()

# Basic scatter plot: wt on x-axis and mpg on y-axis; map cyl to col
ggplot(mtcars, aes(x=wt, y=mpg, col=cyl)) + geom_point(size=4)

# Hollow circles - an improvement
ggplot(mtcars, aes(x=wt, y=mpg, col=cyl)) + geom_point(size=4, shape=1)

# Add transparency - very nice
ggplot(mtcars, aes(x=wt, y=mpg, col=cyl)) + geom_point(size=4, alpha=0.6)

Next, bar plots are examined using the same data:

cyl.am <- ggplot(mtcars, aes(x = factor(cyl), fill = factor(am)))

# Add geom (position = "stack" by default)
cyl.am + geom_bar()

# Fill - show proportion
cyl.am + 
  geom_bar(position = "fill")  

# Dodging - principles of similarity and proximity
cyl.am +
  geom_bar(position = "dodge") 

# Clean up the axes with scale_ functions
val = c("#E41A1C", "#377EB8")
lab = c("Manual", "Automatic")
cyl.am +
  geom_bar(position = "dodge") +
  scale_x_discrete("Cylinders") + 
  scale_y_continuous("Number") +
  scale_fill_manual("Transmission", 
                    values = val,
                    labels = lab) 

# Add a new column called group
mtcars$group <- 0

# Create jittered plot of mtcars: mpg onto x, group onto y
ggplot(mtcars, aes(x = mpg, y=group)) + geom_jitter()

# Change the y aesthetic limits
ggplot(mtcars, aes(x = mpg, y=group)) + geom_jitter() + scale_y_continuous(limits = c(-2, 2))

Further, the diamonds data set is explored to show techniques for minimizing over-plotting problems. Per previous, it is cached due to the lengthy plot times driven by the many data points:

data(diamonds)
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
# Scatter plot: carat (x), price (y), clarity (col)
ggplot(diamonds, aes(x=carat, y=price, col=clarity)) + geom_point()

# Adjust for overplotting
ggplot(diamonds, aes(x=carat, y=price, col=clarity)) + geom_point(alpha = 0.5)

# Scatter plot: clarity (x), carat (y), price (col)
ggplot(diamonds, aes(y=carat, x=clarity, col=price)) + geom_point(alpha = 0.5)

# Dot plot with jittering
ggplot(diamonds, aes(y=carat, x=clarity, col=price)) + geom_point(alpha = 0.5, position="jitter")

The geometries layer includes the most common plot types:

  • Scatter - point, jitter, abline
  • Bar - histogram, bar, errorbar
  • Line - line

Scatter plots examples - geom_point(), geom_jitter(), geom_abline():

  • geom_point() requires that x and y be aesthetics, and has optional aesthetic/attribute for alpha, colour, fill, shape, and size
  • Can specify the aesthetics inside the geom, meaning that the layered geoms may each have a different aesthetic if desired
  • In general, the geom_point() will INHERIT from the ggplot() call, although:
    • If anything is explicitly called inside geom_point() that will override what is in the ggplot() statement
    • Interestingly, can even have geom_point(data=other_data), and that will use other_data but still INHERIT whatever aesthetics for x and y were chosen previously
  • The shape argument is equivalent to the pch argument from base
    • Note that 21-25 have both fill and color which can be controlled independently
  • Note that geom_vline() and geom_hline() will make vertical and horizontal lines
  • Note that ggplot2 can often handle the statistics for you, and that mean without dispersion can tend to be misleading
  • Always optimize shape, size, and alpha blending

Bar plots examples - histogram, bar, errorbar:

  • When using geom_histogram, only x is required to be included as an aesthetic
    • This is actually binned, with a slightly different algorithm than hist() in base geom_histogram(binwidth=) # is excluded, will use range()/30
  • The labels are the intervals, while the y-axis is the counts
    • geom_histogram(aes(y = ..density.. ), binwidth=0.1) # The ..density.. Asks R to look for the internal data frame created by ggplot() as part of the histogram
  • For coloring, we need to use fill= since this is a shape to be filled (col would apply if these were points in a scatter-gram)
    • The col would only change the outside of the bars
  • The default is for the bars to be stacked based on value (position = “identity”)
    • position = “dodge” means one bar per category
    • position = “fill” is a stacked bar chart to 100%
  • The geom_bar is the superset that contains geom_histogram() as one of its components
    • If just geom_bar() is called, the result will look like geom_histogram(), though you can have a purely categorical variable as the aes(x=) component
    • The default argument for geom_bar() is stat=“bin”
    • If instead stat=“identity” is included, then the actual y value inside aes() will be plotted
  • The geom_errorbar() requires its own aes() call inside it – covering ymin, ymax, and width
  • The standard publication “dynamite plot” is strongly discouraged – rationale and alternatives to be explained in a future module

Line plots examples - line:

  • The simplest line graph is set up just like a scatterplot, but with calling geom_line() rather than geom_point()
  • Can add col= as part of the aesthetic, which will color the line by that variable
  • The geom_area() call combined with a fill= in the aes() will provide a stacked line chart
    • If position = “fill” is called inside geom_area(), then it will be a stacked bar, showing proportional trends over time
  • The geom_ribbon() call can put them all on the same scale, provided that ymin=0 is part of aes() inside geom_ribbon(), followed by alpha= so that they are not opaque overlaps

Example code from mtcars includes:

# mtcars point plots
# Plot the cyl on the x-axis and wt on the y-axis
ggplot(mtcars, aes(x=cyl, y=wt)) + geom_point()

# Use geom_jitter() instead of geom_point()
ggplot(mtcars, aes(x=cyl, y=wt)) + geom_jitter()

# Define the position object using position_jitter(): posn.j
posn.j <- position_jitter(width = 0.1)
  
# Use posn.j in geom_point()
ggplot(mtcars, aes(x=cyl, y=wt)) + geom_point(position = posn.j)

# mtcars bar plots
# Make a univariate histogram
ggplot(mtcars, aes(x=mpg)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Change the bin width to 1
ggplot(mtcars, aes(x=mpg)) + geom_histogram(binwidth = 1)

# Change the y aesthetic to density
ggplot(mtcars, aes(x=mpg)) + geom_histogram(binwidth = 1, aes(y=..density..))

# Custom color code
myBlue <- "#377EB8"

# Change the fill color to myBlue
ggplot(mtcars, aes(x=mpg)) + geom_histogram(binwidth = 1, aes(y=..density..), fill=myBlue)

# Draw a bar plot of cyl, filled according to am
ggplot(mtcars, aes(x=cyl, fill=am)) + geom_bar()

# Change the position argument to stack
ggplot(mtcars, aes(x=cyl, fill=am)) + geom_bar(position="stack")

# Change the position argument to fill
ggplot(mtcars, aes(x=cyl, fill=am)) + geom_bar(position="fill")

# Change the position argument to dodge
ggplot(mtcars, aes(x=cyl, fill=am)) + geom_bar(position="dodge")

# Draw a bar plot of cyl, filled according to am
ggplot(mtcars, aes(x=cyl, fill=am)) + geom_bar()

# Change the position argument to "dodge"
ggplot(mtcars, aes(x=cyl, fill=am)) + geom_bar(position = "dodge")

# Define posn_d with position_dodge()
posn_d <- position_dodge(width=0.2)

# Change the position argument to posn_d
ggplot(mtcars, aes(x=cyl, fill=am)) + geom_bar(position = posn_d)

# Use posn_d as position and adjust alpha to 0.6
ggplot(mtcars, aes(x=cyl, fill=am)) + geom_bar(position = posn_d, alpha=0.6)

# A basic histogram, add coloring defined by cyl 
ggplot(mtcars, aes(x=mpg, fill=cyl)) +
  geom_histogram(binwidth = 1)

# Change position to identity 
ggplot(mtcars, aes(x=mpg, fill=cyl)) +
  geom_histogram(binwidth = 1, position="identity")

# Change geom to freqpoly (position is identity by default) 
ggplot(mtcars, aes(x=mpg, col=cyl)) +
  geom_freqpoly(binwidth = 1, position="identity")

# Example of how to use a brewed color palette
ggplot(mtcars, aes(x = cyl, fill = am)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set1")

# Basic histogram plot command
ggplot(mtcars, aes(mpg)) + 
  geom_histogram(binwidth = 1)

# Expand the histogram to fill using am
ggplot(mtcars, aes(x=mpg, fill=am)) + 
  geom_histogram(binwidth = 1)

# Change the position argument to "dodge"
ggplot(mtcars, aes(x=mpg, fill=am)) + 
  geom_histogram(binwidth = 1, position="dodge")

# Change the position argument to "fill"
ggplot(mtcars, aes(x=mpg, fill=am)) + 
  geom_histogram(binwidth = 1, position="fill")
## Warning: Removed 16 rows containing missing values (geom_bar).

# Change the position argument to "identity" and set alpha to 0.4
ggplot(mtcars, aes(x=mpg, fill=am)) + 
  geom_histogram(binwidth = 1, position="identity", alpha = 0.4)

# Change fill to cyl
ggplot(mtcars, aes(x=mpg, fill=cyl)) + 
  geom_histogram(binwidth = 1, position="identity", alpha = 0.4)

Next, a few examples are run from dataset car::Vocab (cached due to plotting size/time):

Vocab <- car::Vocab
str(Vocab)
## 'data.frame':    21638 obs. of  4 variables:
##  $ year      : int  2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
##  $ sex       : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 2 1 2 2 1 ...
##  $ education : int  9 14 14 17 14 14 12 10 11 9 ...
##  $ vocabulary: int  3 6 9 8 1 7 6 6 5 1 ...
# Basic scatter plot of vocabulary (y) against education (x). Use geom_point()
ggplot(Vocab, aes(x=education, y=vocabulary)) + geom_point()

# Use geom_jitter() instead of geom_point()
ggplot(Vocab, aes(x=education, y=vocabulary)) + geom_jitter()

# Using the above plotting command, set alpha to a very low 0.2
ggplot(Vocab, aes(x=education, y=vocabulary)) + geom_jitter(alpha = 0.2)

# Using the above plotting command, set the shape to 1
ggplot(Vocab, aes(x=education, y=vocabulary)) + geom_jitter(alpha = 0.2, shape=1)

# Plot education on x and vocabulary on fill
# Use the default brewed color palette
ggplot(Vocab, aes(x = education, fill = vocabulary)) +
  geom_bar(position="fill") +
  scale_fill_brewer()

# Definition of a set of blue colors
blues <- brewer.pal(9, "Blues")

# Make a color range using colorRampPalette() and the set of blues
blue_range <- colorRampPalette(blues)

# Use blue_range to adjust the color of the bars, use scale_fill_manual()
ggplot(Vocab, aes(x = education, fill = factor(vocabulary))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = blue_range(11))

Lastly, a few additional plots are displayed:

# Print out head of economics
data(economics)
head(economics)
## # A tibble: 6 × 6
##         date   pce    pop psavert uempmed unemploy
##       <date> <dbl>  <int>   <dbl>   <dbl>    <int>
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
## 3 1967-09-01 516.3 199113    11.7     4.6     2958
## 4 1967-10-01 512.9 199311    12.5     4.9     3143
## 5 1967-11-01 518.1 199498    12.5     4.7     3066
## 6 1967-12-01 525.8 199657    12.1     4.8     3018
# Plot unemploy as a function of date using a line plot
ggplot(economics, aes(x = date, y = unemploy)) + geom_line()

# Adjust plot to represent the fraction of total population that is unemployed
ggplot(economics, aes(x = date, y = unemploy/pop)) + geom_line()

recess <- data.frame(begin=as.Date(c(-31, 1400, 3652, 4199, 7486, 11382), origin="1970-01-01"), end=as.Date(c(304, 1885, 3834, 4687, 7729, 11627), origin="1970-01-01"))

ggplot(economics, aes(x = date, y = unemploy/pop)) +
  geom_line() + 
  geom_rect(data=recess, inherit.aes=FALSE, aes(xmin=begin, xmax=end, ymin=-Inf, ymax=+Inf), fill="red", alpha=0.2)

# Cannot find dataset . . . 
# Check the structure as a starting point
# str(fish.species)

# Use gather to go from fish.species to fish.tidy
# fish.tidy <- gather(fish.species, Species, Capture, -Year)

# Recreate the plot shown on the right
# ggplot(fish.tidy, aes(x = Year, y = Capture, col=Species)) + geom_line()

The qplot functionality is for making quick and dirty plots:

  • qplot(x, y, data=) will make a scatterplot
  • Can add col=, shape=, size= and the like
  • Can add geom=“jitter” (or whatever) to be even more explicit
  • Can add alpha=I(0.5) where the I() function means “inhibit”, preventing R from treating this as a new data point (somehow . . . )

Basically, the qplot() is nice for just a quick and dirty analysis, though it will have much less flexibility on a go-forward basis.

Example code for qplot includes:

# The old way (shown)
plot(mpg ~ wt, data = mtcars)

# Using ggplot:
ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point()

# Using qplot:
qplot(wt, mpg, data=mtcars)

# Categorical:
# cyl
qplot(wt, mpg, data=mtcars, size=cyl)
## Warning: Using size for a discrete variable is not advised.

# gear
qplot(wt, mpg, data=mtcars, size=gear)

# Continuous
# hp
qplot(wt, mpg, data=mtcars, col=hp)

# qsec
qplot(wt, mpg, data=mtcars, col=qsec)

# qplot() with x only
qplot(factor(cyl), data=mtcars)

# qplot() with x and y
qplot(factor(cyl), factor(vs), data=mtcars)

# qplot() with geom set to jitter manually
qplot(factor(cyl), factor(vs), data=mtcars, geom="jitter")

# Make a dot plot with ggplot
ggplot(mtcars, aes(x=cyl, y=wt, fill = factor(am))) + 
    geom_dotplot(stackdir="center", binaxis="y")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

# qplot with geom "dotplot", binaxis = "y" and stackdir = "center"
qplot(cyl, wt, fill=factor(am), data=mtcars, geom="dotplot", binaxis="y", stackdir="center")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

Course #1 wrap-up comments:

  • Start with “who is the intended audience” so thet form follows function
  • Good plots include Grammatical Elements and Aesthetic Mappings
    • This course focused on three Grammatical Elements (Data, Aestehtics, and Geometries)
    • The Aesthetic Mappings can be thought of as scales for encoding elements
  • Proper data formats and transformations make the appropriate information available for plotting
    • Recall the tidying for iris, iris.wide, and iris.tidy
    • The proper data depends on the plot that we want to create
  • The aesthetic choice depends on both the variable type (continuous vs. discrete) as well as the ease for the audience to decode the meaning
  • Common plot types include scatter, line, and bar
  • Recall the difference in the aesthetics and the attributes
  • Jitter or alpha can be good techniques for overcoming over-plotting (can use position= as an additional option)

A few warp-up coding exercises for ggplot #1 include:

# Check out the head of ChickWeight
data(ChickWeight)
head(ChickWeight)
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
# Use ggplot() for the second instruction
ggplot(ChickWeight, aes(x=Time, y=weight)) + geom_line(aes(group=Chick))

# Use ggplot() for the third instruction
ggplot(ChickWeight, aes(x=Time, y=weight, col=Diet)) + geom_line(aes(group=Chick))

# Use ggplot() for the last instruction
ggplot(ChickWeight, aes(x=Time, y=weight, col=Diet)) + geom_line(aes(group=Chick), alpha=0.3) + geom_smooth(lwd=2, se=FALSE)
## `geom_smooth()` using method = 'loess'

# Check out the structure of titanic
library(titanic)
## Warning: package 'titanic' was built under R version 3.2.5
library(dplyr)

titanicFull <- titanic::titanic_train
str(titanicFull)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
titanic <- titanicFull %>% 
    select(Pclass, Sex, Survived, Age) %>% 
    filter(complete.cases(.))
str(titanic)
## 'data.frame':    714 obs. of  4 variables:
##  $ Pclass  : int  3 1 3 1 3 1 3 3 2 3 ...
##  $ Sex     : chr  "male" "female" "female" "female" ...
##  $ Survived: int  0 1 1 1 0 0 0 1 1 1 ...
##  $ Age     : num  22 38 26 35 35 54 2 27 14 4 ...
# Use ggplot() for the first instruction
ggplot(titanic, aes(x=factor(Pclass), fill=factor(Sex))) + 
    geom_bar(position = "dodge")

# Use ggplot() for the second instruction
ggplot(titanic, aes(x=factor(Pclass), fill=factor(Sex))) + 
    geom_bar(position = "dodge") + 
    facet_grid(. ~ Survived)

# Position jitter (use below)
posn.j <- position_jitter(0.5, 0)

# Use ggplot() for the last instruction
ggplot(titanic, aes(x=factor(Pclass), y=Age, col=factor(Sex))) + 
    geom_jitter(size=3, alpha=0.5, position=posn.j) + 
    facet_grid(. ~ Survived)

Data Visualization (ggplot2 part 2)

The second course expands on the remaining layers of ggplot2: Statistics, Coordinates, Facets, and Themes.

The statistics layer for ggplot2 has two basic components:

  • Called from within a geom
  • Called from outside a geom (independently)
  • The stat functions all start with stat_ meaning that stat_bin() is the binning call for histograms
    • stat_bin() and geom_bar() are closely related
    • stat_smooth() and geom_smooth() are closely related – se is by defautl the 95% CI
      • The loess function is used for smallish (<1000) bins, with span= available as an option to change the window size
      • Can instead specifically call method=“lm” and the line will be based on OLS, with 95% CI defaulted in addition
      • Can also add the fullrange= call to request that the predictions be made for out-of-range data
  • The relationships between the stat_ and geom_ help to 1) explain warnings/errors, and 2) provide help pages for best debugging/tuning parameters

The statistics can also be called independently (outside the geom):

  • Note that Hmisc::smean.sdl(x, mult=1) will return mean, mean-1SD, mean+1SD as a vector
    • The default is mult=2, so Hmisc::smean() defaults to providing mean +/- 2SD
  • Within ggplot, this is called as mean_sdl(), with the first argument the vector, and the second argument the mult=
  • Can be called inside stat_summary(fun.data = mean_sdl, fun.args=list(mult=1))
  • Alternately, can be called as stat_summary(fun.y = mean, geom=“point”) + stat_summary(fun.data = mean_sdl, fun.args=list(mult=1), geom=“errorbar”, width=0.1) # will make a point mean with errorbars
  • Note that Hmisc::smean.cl.normal(x) will return mean, then 95% CI
    • Within ggplot, this is called as mean_cl_normal(x)
  • Can use anything inside stat_summary provided that the output is aligned with what the specified geom will require
  • The most common function calls are stat_summary(), stat_function(), and stat_qq()
  • Now back to the MASS::mammals dataset for brain vs body
    • Can add stat_function(fun = dnorm, color=“red”, arg=list(mean = mean(mam.new\(body), sd = sd(mam.new\)body)) and the normal density will plot on top of the graph
    • The geom_rug() can be helpful for putting the rug underneath the histogram

Example code from mtcars includes:

# Explore the mtcars data frame with str()
data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# A scatter plot with LOESS smooth:
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess'

# A scatter plot with an ordinary Least Squares linear model:
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() + 
  geom_smooth(method = "lm")

# The previous plot, without CI ribbon:
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() + 
  geom_smooth(method = "lm", se=FALSE)

# The previous plot, without points:
ggplot(mtcars, aes(x = wt, y = mpg)) + 
    geom_smooth(method = "lm", se=FALSE)

# Define cyl as a factor variable
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)

# Complete the following ggplot command as instructed
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE) + 
  stat_smooth(method = "lm", se = FALSE, aes(group=1))

# Plot 1: change the LOESS span
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  # Add span below 
  geom_smooth(se = FALSE, span=0.7)
## `geom_smooth()` using method = 'loess'

# Plot 2: Set the overall model to LOESS and use a span of 0.7
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE) +
  # Change method and add span below
  stat_smooth(method = "loess", aes(group = 1), 
              se = FALSE, col = "black", span=0.7)

# Plot 3: Set col to "All", inside the aes layer of stat_smooth()
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE) +
  stat_smooth(method = "loess",
              # Add col inside aes()
              aes(group = 1, col="All"), 
              # Remove the col argument below
              se = FALSE, span = 0.7)

# Plot 4: Add scale_color_manual to change the colors
myColors <- c(brewer.pal(3, "Dark2"), "black")
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE, span = 0.75) +
  stat_smooth(method = "loess", 
              aes(group = 1, col="All"), 
              se = F, span = 0.7) +
  # Add correct arguments to scale_color_manual
  scale_color_manual("Cylinders", values=myColors)

# Display structure of mtcars
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# Convert cyl and am to factors:
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)

# Define positions:
posn.d <- position_dodge(width = 0.1)
posn.jd <- position_jitterdodge(jitter.width = 0.1, dodge.width = 0.2)
posn.j <- position_jitter(width = 0.2)

# base layers:
wt.cyl.am <- ggplot(mtcars, aes(x=cyl, y=wt, col=am, fill=am, group=am))

# Plot 1: Jittered, dodged scatter plot with transparent points
wt.cyl.am +
  geom_point(position = posn.jd, alpha = 0.6)

# Plot 2: Mean and SD - the easy way
wt.cyl.am +
  geom_point(position = posn.jd, alpha = 0.6) + 
  stat_summary(fun.data=mean_sdl, fun.args=list(mult=1), position=posn.d)

# Plot 3: Mean and 95% CI - the easy way
wt.cyl.am +
  geom_point(position = posn.jd, alpha = 0.6) + 
  stat_summary(fun.data=mean_cl_normal, position=posn.d)

# Plot 4: Mean and SD - with T-tipped error bars - fill in ___
wt.cyl.am +
  stat_summary(geom = "point", fun.y = mean, 
               position = posn.d) +
  stat_summary(geom = "errorbar", fun.data = mean_sdl, 
               position = posn.d, fun.args = list(mult = 1), width = 0.1)

xx <- 1:100

# Function to save range for use in ggplot 
gg_range <- function(x) {
  # Change x below to return the instructed values
  data.frame(ymin = min(x), # Min
             ymax = max(x)
             ) # Max
}

gg_range(xx)
##   ymin ymax
## 1    1  100
# Required output:
#   ymin ymax
# 1    1  100

# Function to Custom function:
med_IQR <- function(x) {
  # Change x below to return the instructed values
  data.frame(y = median(x), # Median
             ymin = quantile(x, 0.25), # 1st quartile
             ymax = quantile(x, 0.75)
             )  # 3rd quartile
}

med_IQR(xx)
##        y  ymin  ymax
## 25% 50.5 25.75 75.25
# Required output:
#        y  ymin  ymax
# 25% 50.5 25.75 75.25

wt.cyl.am <- ggplot(mtcars, aes(x = cyl,y = wt, col = am, fill = am, group = am))

# Add three stat_summary calls to wt.cyl.am
wt.cyl.am + 
  stat_summary(geom = "linerange", fun.data = med_IQR, 
               position = posn.d, size = 3) +
  stat_summary(geom = "linerange", fun.data = gg_range, 
               position = posn.d, size = 3, 
               alpha = 0.4) +
  stat_summary(geom = "point", fun.y = median, 
               position = posn.d, size = 3, 
               col = "black", shape = "X")

Further examples (cached) from car::Vocab include:

Vocab <- car::Vocab

# Plot 1: Jittered scatter plot, add a linear model (lm) smooth:
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  geom_jitter(alpha = 0.2) +
  stat_smooth(method="lm", se=FALSE)

# Plot 2: Only lm, colored by year
ggplot(Vocab, aes(x = education, y = vocabulary, col=factor(year))) +
  stat_smooth(method="lm", se=FALSE)

# Plot 3: Set a color brewer palette
ggplot(Vocab, aes(x = education, y = vocabulary, col=factor(year))) +
  stat_smooth(method="lm", se=FALSE) + 
  scale_color_brewer()
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Blues is 9
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Blues is 9
## Returning the palette you asked for with that many colors

# Plot 4: Add the group, specify alpha and size
ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group=factor(year))) +
  stat_smooth(method = "lm", se = FALSE, alpha=0.6, size=2) +
  scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))

# Use stat_quantile instead of stat_smooth:
ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
  stat_quantile(alpha = 0.6, size = 2) +
  scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique
## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique

# Set quantile to 0.5:
ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
  stat_quantile(alpha = 0.6, size = 2, quantiles=c(0.5)) +
  scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique

# Plot with linear and loess model
p <- ggplot(Vocab, aes(x = education, y = vocabulary)) +
       stat_smooth(method = "loess", aes(col = "red"), se = F) +
       stat_smooth(method = "lm", aes(col = "blue"), se = F) +
       scale_color_discrete("Model", labels = c("red" = "LOESS", "blue" = "lm"))

# Add stat_sum
p + stat_sum()

# Add stat_sum and set size range
p + stat_sum() + scale_size(range = c(1, 10))

The coordinates layers control the plot dimensions:

  • Typically using coord_, such as coord_cartesian()
    Examples for zooming in on a plot
    • scale_x_continuous(limits = )
    • xlim()
    • coord_cartesian(xlim = ) # has the benefit that there are no more warning messages; the whole plot is created, and then this specific zoom is selected
  • Aspect ratios should often, but not always, be on a 1:1 basis
    • coord_fixed(xNum) # the height will now be on a scale of xNum:1 vs. the width

The facets are based on the concept of small multiples as per the Tufte book on “Visulaization of Quantitative Information” (1983):

  • The concept is to split one large plot in to several smaller plots that each contain the same coordinate system
  • facet_grid(rows ~ columns), so facet_grid(. ~ Species) will mean make a single row, and use Species as the columns
    • The facet is just the splitting up of the data based on a factor variable

Example code includes:

data(mtcars); 
mtcars$cyl <- as.factor(mtcars$cyl); 
mtcars$am <- as.factor(mtcars$am)


# Basic ggplot() command, coded for you
p <- ggplot(mtcars, aes(x = wt, y = hp, col = am)) + geom_point() + geom_smooth()

# Add scale_x_continuous
p + scale_x_continuous(limits = c(3,6), expand=c(0,0))
## `geom_smooth()` using method = 'loess'
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at 3.168
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 4e-006
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 3.168
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.002
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at 3.572
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 4e-006
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4e-006
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger
## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf in foreign function call (arg 5)
## Warning: Removed 12 rows containing missing values (geom_point).

# The proper way to zoom in:
p + coord_cartesian(xlim=c(3, 6))
## `geom_smooth()` using method = 'loess'

data(iris)

# Complete basic scatter plot function
base.plot <- ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, col=Species)) +
               geom_jitter() +
               geom_smooth(method = "lm", se = FALSE)

# Plot base.plot: default aspect ratio
base.plot

# Fix aspect ratio (1:1) of base.plot
base.plot + coord_equal()

# Create stacked bar plot: thin.bar
thin.bar <- ggplot(mtcars, aes(x=1, fill=cyl)) +
              geom_bar()

# Convert thin.bar to pie chart
thin.bar + coord_polar(theta = "y")

# Create stacked bar plot: wide.bar
wide.bar <- ggplot(mtcars, aes(x=1, fill=cyl)) +
              geom_bar(width=1)

# Convert wide.bar to pie chart
wide.bar + coord_polar(theta="y")

# Basic scatter plot:
p <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

# Separate rows according to transmission type, am
p + facet_grid(am ~ .)

# Separate columns according to cylinders, cyl
p + facet_grid(. ~ cyl)

# Separate by both columns and rows 
p + facet_grid(am ~ cyl)

# Code to create the cyl_am col and myCol vector
mtcars$cyl_am <- paste(mtcars$cyl, mtcars$am, sep = "_")
myCol <- rbind(brewer.pal(9, "Blues")[c(3,6,8)],
               brewer.pal(9, "Reds")[c(3,6,8)])

# Basic scatter plot, add color scale:
ggplot(mtcars, aes(x = wt, y = mpg, col=cyl_am)) +
  geom_point() + scale_color_manual(values = myCol)

# Facet according on rows and columns.
ggplot(mtcars, aes(x = wt, y = mpg, col=cyl_am)) +
  geom_point() + scale_color_manual(values = myCol) + 
  facet_grid(gear ~ vs)

# Add more variables
ggplot(mtcars, aes(x = wt, y = mpg, col=cyl_am, size=disp)) +
  geom_point() + scale_color_manual(values = myCol) + 
  facet_grid(gear ~ vs)

mamsleep <- tidyr::gather(ggplot2::msleep %>% 
                              mutate(total = sleep_total, rem=sleep_rem) %>%
                              select(vore, name, total, rem) %>% 
                              filter(!is.na(total), !is.na(rem)), 
                          sleep, time, -c(vore, name))

mamsleep$sleep <- factor(mamsleep$sleep, levels=c("total", "rem"))
str(mamsleep)
## Classes 'tbl_df', 'tbl' and 'data.frame':    122 obs. of  4 variables:
##  $ vore : chr  "omni" "herbi" "omni" "herbi" ...
##  $ name : chr  "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" "Cow" ...
##  $ sleep: Factor w/ 2 levels "total","rem": 1 1 1 1 1 1 1 1 1 1 ...
##  $ time : num  17 14.4 14.9 4 14.4 8.7 10.1 5.3 9.4 10 ...
# Basic scatter plot
ggplot(mamsleep, aes(x=time, y=name, col=sleep)) + geom_point()

# Facet rows accoding to vore
ggplot(mamsleep, aes(x=time, y=name, col=sleep)) + geom_point() + facet_grid(vore ~ .)

# Specify scale and space arguments to free up rows
ggplot(mamsleep, aes(x=time, y=name, col=sleep)) + geom_point() + 
    facet_grid(vore ~ ., scale="free_y", space="free_y")

The themes layer controls all the “non-data ink” on your plot:

  • This means the visual elements that are not part of the data; typically text, line, or rectangle
    • These are called as element_text(), element_line(), or element_rectangle()
    • The headers for the individual facets are referred to as “strip text”
  • Text is all the titles, axis labels, legend labels, and the like
  • Lines are the axis ticks, axis lines, panel grids, and the like
  • Rectangles include legend background/key, panel background/border, plot/strip background
  • Many of the attributes inherit from others
    • Everything in text inherits from text
    • Everything in line inherits from line
    • Everything in rect inherits from rect
    • Further, things like axis.text.x will inherit from axis
  • The element_blank() is for removing everything
    • theme(text = element_blank(), line=element_blank(), rect=element_blank()) # will just plot the points

Often an objective to repeat the theme multiple times in the same presentation:

  • Typical approach would be to save the theme layer as an object, then use it as needed
  • There are also some built-in themes available for use
    • theme_classic()
    • library(ggthemes); theme_tufte()
    • original <- theme_update() # everything in theme_update is modified, with the original values stored in original
  • Can also declare a theme_set() at the top of the code, and everything that follows will use that theme

Example code includes:

# Rough re-engineer of what is pre-defined as z
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)

myColors <- c(brewer.pal(9, "Blues"))[c(5, 7, 9)]

origZ <- ggplot(mtcars, aes(x=wt, y=mpg, col=cyl)) + 
         geom_point(size=2) + 
         geom_smooth(method="lm", se=FALSE) + 
         facet_wrap(~ cyl, nrow=1) + 
         scale_x_continuous("Weight (lb/1000)") + 
         scale_y_continuous("Miles / (US) gallon") +
         scale_color_manual("Cylinders", values=myColors)
z <- origZ


# Plot 1: change the plot background color to myPink:
myPink <- "#FEE0D2"
z + theme(plot.background = element_rect(fill = myPink))

# Plot 2: adjust the border to be a black line of size 3
z + theme(plot.background = element_rect(fill = myPink, color="black", size=3))

# Plot 3: set panel.background, legend.key, legend.background and strip.background to element_blank()
uniform_panels <- theme(panel.background = element_blank(), 
                        legend.key = element_blank(), 
                        legend.background=element_blank(), 
                        strip.background = element_blank())
z <- z + theme(plot.background = element_rect(fill = myPink, color="black", size=3)) + uniform_panels
z

# Extend z with theme() function and three arguments
z <- z + theme(panel.grid=element_blank(), axis.line=element_line(color="black"),
               axis.ticks=element_line(color="black")
               )
z

# Extend z with theme() function and four arguments
myRed <- "#99000D"
z <- z + theme(strip.text = element_text(size=16, color=myRed), 
               axis.title.y=element_text(color=myRed, hjust=0, face="italic"),
               axis.title.x=element_text(color=myRed, hjust=0, face="italic"), 
               axis.text=element_text(color="black")
               )
z

# Move legend by position
z + theme(legend.position = c(0.85, 0.85))

# Change direction
z + theme(legend.direction = "horizontal")

# Change location by name
z + theme(legend.position = "bottom")

# Remove legend entirely
z + theme(legend.position = "none")

# Increase spacing between facets
library(grid)
z + theme(panel.margin.x = unit(2, "cm"))
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
## Warning: `panel.margin.x` is deprecated. Please use `panel.spacing.x`
## property instead

# Add code to remove any excess plot margin space
z + theme(panel.margin.x = unit(2, "cm"), plot.margin = unit(c(0,0,0,0), "cm"))
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead

## Warning: `panel.margin.x` is deprecated. Please use `panel.spacing.x`
## property instead

# Make z2 the same as origZ
z2 <- origZ

# Theme layer saved as an object, theme_pink
theme_pink <- theme(panel.background = element_blank(),
                    legend.key = element_blank(),
                    legend.background = element_blank(),
                    strip.background = element_blank(),
                    plot.background = element_rect(fill = myPink, color = "black", size = 3),
                    panel.grid = element_blank(),
                    axis.line = element_line(color = "black"),
                    axis.ticks = element_line(color = "black"),
                    strip.text = element_text(size = 16, color = myRed),
                    axis.title.y = element_text(color = myRed, hjust = 0, face = "italic"),
                    axis.title.x = element_text(color = myRed, hjust = 0, face = "italic"),
                    axis.text = element_text(color = "black"),
                    legend.position = "none")
  
# Apply theme_pink to z2
z2 + theme_pink

# Change code so that old theme is saved as old
old <- theme_update(panel.background = element_blank(),
             legend.key = element_blank(),
             legend.background = element_blank(),
             strip.background = element_blank(),
             plot.background = element_rect(fill = myPink, color = "black", size = 3),
             panel.grid = element_blank(),
             axis.line = element_line(color = "black"),
             axis.ticks = element_line(color = "black"),
             strip.text = element_text(size = 16, color = myRed),
             axis.title.y = element_text(color = myRed, hjust = 0, face = "italic"),
             axis.title.x = element_text(color = myRed, hjust = 0, face = "italic"),
             axis.text = element_text(color = "black"),
             legend.position = "none")

# Display the plot z2
z2

# Restore the old plot
theme_set(old)

# Load ggthemes package
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.2.5
# Apply theme_tufte
z2 + theme_tufte()

# Apply theme_tufte, modified:
z2 + theme_tufte() + 
    theme(legend.position = c(0.9, 0.9), legend.title=element_text(face="italic", size=12), axis.title=element_text(face="bold", size=14))

Best practices for plotting and visulaization:

  • The “dynamite plot” (bar plot with errorbars) is not recommended
    • Suggests that the data is normally distributed (even though that may not be the case)
    • Does not provide any information about the n per category
    • Bars also suggest that there is “data at 0”, which with many activities is patently untrue
    • Also no bars above the mean (where there is certainly data)
  • One alternative is to just plot the data points using a jitter/alpha so they can all be seen
    • The error bars with the jittered points are a much cleaner visulaization of the data

Recall that pie charts are just bar charts wrapped on to polar coordinates:

  • The pie chart is often prone to misues since people interpret it differently - angle vs. area vs. arc-length vs. etc.
  • Best practice is to reserve the pie chart for where there is a large, macro difference and only looking at ~3 groups
  • The better practice is the stacked bar that sums to 100%; no need to rotate it on to polar coordinates

Suggestion that heat maps are “one of the least effective forms of communication”:

  • Perceptions of colors can change based on the context
  • Tendency for the heat map to convey a “wow factor” even though there is precious little information communicated
  • Much better to facet and dot-plot the data on a continuous axis
    • It is very good for exploratory analysis; may be too dense for publication
  • Final choices will depend on the data density and the composition of the audience

Example code includes:

data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am)
m <- ggplot(mtcars, aes(x = cyl, y = wt))


# Draw dynamite plot
m +
  stat_summary(fun.y = mean, geom = "bar", fill = "skyblue") +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)

# Base layers
m <- ggplot(mtcars, aes(x = cyl, y = wt, col = am, fill = am))

# Plot 1: Draw dynamite plot
m +
  stat_summary(fun.y = mean, geom = "bar") +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)

# Plot 2: Set position dodge in each stat function
m +
  stat_summary(fun.y = mean, geom = "bar", position = "dodge") +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), 
               geom = "errorbar", width = 0.1, position = "dodge")

# Set your dodge posn manually
posn.d <- position_dodge(0.9)

# Plot 3:  Redraw dynamite plot
m +
  stat_summary(fun.y = mean, geom = "bar", position = posn.d) +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", 
               width = 0.1, position = posn.d
               )

cyl_means <- tapply(mtcars$wt, mtcars$cyl, FUN=mean)
cyl_sd <- tapply(mtcars$wt, mtcars$cyl, FUN=sd)
cyl_ct <- tapply(mtcars$wt, mtcars$cyl, FUN=length)
mtcars.cyl <- data.frame(cyl=as.factor(names(cyl_means)), 
                         wt.avg=cyl_means, sd=cyl_sd, prop=cyl_ct/sum(cyl_ct)
                         )


# Base layers
m <- ggplot(mtcars.cyl, aes(x = cyl, y = wt.avg))

# Plot 1: Draw bar plot
m + geom_bar(stat = "identity", fill="skyblue")

# Plot 2: Add width aesthetic
m + geom_bar(stat = "identity", fill="skyblue", aes(width=prop))
## Warning: Ignoring unknown aesthetics: width

# Plot 3: Add error bars
m + geom_bar(stat = "identity", fill="skyblue", aes(width=prop)) +
    geom_errorbar(aes(ymin = wt.avg - sd, ymax = wt.avg + sd), width=0.1)
## Warning: Ignoring unknown aesthetics: width

ggplot(mtcars, aes(x = factor(1), fill = am)) +
  geom_bar(position = "fill", width=1) + 
  facet_grid(. ~ cyl) + 
  coord_polar(theta="y")

library(GGally)
## Warning: package 'GGally' was built under R version 3.2.5
## Warning: replacing previous import by 'utils::capture.output' when loading
## 'GGally'
## Warning: replacing previous import by 'utils::head' when loading 'GGally'
## Warning: replacing previous import by 'utils::installed.packages' when
## loading 'GGally'
## Warning: replacing previous import by 'utils::str' when loading 'GGally'
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
# All columns except am
group_by_am <- match("am", names(mtcars))
my_names_am <- (1:11)[-group_by_am]

# Basic parallel plot - each variable plotted as a z-score transformation
ggparcoord(mtcars, my_names_am, groupColumn = group_by_am, alpha = 0.8)

barley <- lattice::barley
str(barley)
## 'data.frame':    120 obs. of  4 variables:
##  $ yield  : num  27 48.9 27.4 39.9 33 ...
##  $ variety: Factor w/ 10 levels "Svansota","No. 462",..: 3 3 3 3 3 3 7 7 7 7 ...
##  $ year   : Factor w/ 2 levels "1932","1931": 2 2 2 2 2 2 2 2 2 2 ...
##  $ site   : Factor w/ 6 levels "Grand Rapids",..: 3 6 4 5 1 2 3 6 4 5 ...
# Create color palette
myColors <- brewer.pal(9, "Reds")

# Build the heat map from scratch
ggplot(barley, aes(x=year, y=variety, fill=yield)) + 
    geom_tile() + 
    facet_wrap(~ site, ncol=1) + 
    scale_fill_gradientn(colors = myColors)

# Line plots
ggplot(barley, aes(x=year, y=yield, col=variety, group=variety)) + 
    geom_line() + 
    facet_wrap(~ site, nrow=1)

ggplot(barley, aes(x=year, y=yield, col=site, group=site, fill=site)) + 
    stat_summary(fun.y = mean, geom="line") + 
    stat_summary(fun.data = mean_sdl, fun.args=list(mult=1), geom="ribbon", col=NA, alpha=0.1)

CHIS (California Health Information Study) - Descriptive Statistics Case Study:

  • Data available at healthpolicy.ucla.edu
  • Data set “adult” from the 2009 version, starts with 47614 x 536
    • Reduced data set “adult” from the 2009 version will have 44346 x 10
      • RBMI, BMI_P, RACEHPR2, SRSEX, SRAGE_P, MARIT2, AB1, ASTCUR, AB51, POVLL
      • Filtered to ethnicities White, Black, Asian, Latino

The mosaic plot is a sequence of rectangles, each sized based on the size of the group:

  • X-axis is then the total number of individuals
  • Mosaic plots are a good means of looking at chi-squared data
  • Table can report on both the size and the underlying statistics
    • See for example the mosaic() command available in library vcd
    • Could then color each rectangle based on whether it has been over or under represented

Case study code - creating a Merimeko plot:

# TBD - need the raw data first!

Data Visualization (ggplot2 part 3)

Advanced course that builds on concepts from the previous modules:

  • Chapter 1 - graphs for the statistically savvy (box plots, density plots, violin plots, etc.)
  • Chapter 2 - graphs for very specific data types (large data, networks, etc.)
  • Chapter 3 - graphing with maps (cartography) and animations
  • Chapter 4 - internals of the ggplot2 package (grid, manipulations for efficiency)
  • Chapter 5 - case studies (building geoms from scratch; and a Tufte-plot)

Refresher code from previous modules includes:

# Create movies_small
library(ggplot2movies)
## Warning: package 'ggplot2movies' was built under R version 3.2.5
set.seed(123)
movies_small <- movies[sample(nrow(movies), 1000), ]
movies_small$rating <- factor(round(movies_small$rating))

# Create movies_small
library(ggplot2movies)
set.seed(123)
movies_small <- movies[sample(nrow(movies), 1000), ]
movies_small$rating <- factor(round(movies_small$rating))

# Explore movies_small with str()
str(movies_small)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  24 variables:
##  $ title      : chr  "Fair and Worm-er" "Shelf Life" "House: After Five Years of Living" "Three Long Years" ...
##  $ year       : int  1946 2000 1955 2003 1963 1992 1999 1972 1994 1985 ...
##  $ length     : int  7 4 11 76 103 107 87 84 127 94 ...
##  $ budget     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ rating     : Factor w/ 10 levels "1","2","3","4",..: 7 7 6 8 8 5 4 8 5 5 ...
##  $ votes      : int  16 11 15 11 103 28 105 9 37 28 ...
##  $ r1         : num  0 0 14.5 4.5 4.5 4.5 14.5 0 4.5 4.5 ...
##  $ r2         : num  0 0 0 0 4.5 0 4.5 0 4.5 0 ...
##  $ r3         : num  0 0 4.5 4.5 0 4.5 4.5 0 14.5 4.5 ...
##  $ r4         : num  0 0 4.5 0 4.5 4.5 4.5 0 4.5 14.5 ...
##  $ r5         : num  4.5 4.5 0 0 4.5 0 4.5 14.5 24.5 4.5 ...
##  $ r6         : num  4.5 24.5 34.5 4.5 4.5 0 14.5 0 4.5 14.5 ...
##  $ r7         : num  64.5 4.5 24.5 0 14.5 4.5 14.5 14.5 14.5 14.5 ...
##  $ r8         : num  14.5 24.5 4.5 4.5 14.5 24.5 14.5 24.5 14.5 14.5 ...
##  $ r9         : num  0 0 0 14.5 14.5 24.5 14.5 14.5 4.5 4.5 ...
##  $ r10        : num  14.5 24.5 14.5 44.5 44.5 24.5 14.5 44.5 4.5 24.5 ...
##  $ mpaa       : chr  "" "" "" "" ...
##  $ Action     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Comedy     : int  1 0 0 1 0 1 1 1 0 0 ...
##  $ Drama      : int  0 0 0 0 1 0 0 0 1 1 ...
##  $ Documentary: int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Romance    : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Short      : int  1 1 1 0 0 0 0 0 0 0 ...
# Build a scatter plot with mean and 95% CI
ggplot(movies_small, aes(x = rating, y = votes)) +
  geom_point() +
  stat_summary(fun.data = "mean_cl_normal",
               geom = "crossbar",
               width = 0.2,
               col = "red") +
  scale_y_log10()

# Reproduce the plot
ggplot(diamonds, aes(x=carat, y=price, col=color)) + 
  geom_point(alpha=0.5, size=0.5, shape=16) + 
  scale_x_log10(limits=c(0.1, 10)) + 
  xlab(expression(log[10](Carat))) + 
  scale_y_log10(limits=c(100, 100000)) +  
  ylab(expression(log[10](Price))) + 
  scale_color_brewer(palette="YlOrRd") + 
  coord_equal() + 
  theme_classic()

# Add smooth layer and facet the plot
ggplot(diamonds, aes(x = carat, y = price, col = color)) +
  geom_point(alpha = 0.5, size = .5, shape = 16) +
  scale_x_log10(expression(log[10](Carat)), limits = c(0.1,10)) +
  scale_y_log10(expression(log[10](Price)), limits = c(100,100000)) +
  scale_color_brewer(palette = "YlOrRd") +
  coord_equal() +
  theme_classic() + 
  stat_smooth(method="lm") + 
  facet_grid(. ~ cut)

Box plots creation and usage (more for an academic audience). John Tukey - “Exploratory Data Analysis” (median, IQR, max, min, extremes outside of 1.5 * IQR). Whiskers are only drawn up to the “fence” (1.5 * IQR); everything above this is the outlier.

Example code includes:

# Add a boxplot geom
d <- ggplot(movies_small, aes(x = rating, y = votes)) +
  geom_point() +
  geom_boxplot() +
  stat_summary(fun.data = "mean_cl_normal",
               geom = "crossbar",
               width = 0.2,
               col = "red")

# Untransformed plot
d

# Transform the scale
d + scale_y_log10()

# Transform the coordinates (produces error in RStudio - commented out)
# d + coord_trans(y = "log10")

# Plot object p
p <- ggplot(diamonds, aes(x = carat, y = price))

# Use cut_interval
p + geom_boxplot(aes(group = cut_interval(carat, n=10)))

# Use cut_number
p + geom_boxplot(aes(group = cut_number(carat, n=10)))

# Use cut_width
p + geom_boxplot(aes(group = cut_width(carat, width=0.25)))

Density plots are a good way to describe univariate data. There are many statistics (e.g., PDF or Probability Density Function), including the “Kernel Density Estimate” (KDE) - “sum of bumps placed at the observations; kernel function determines the shape of the bumps, while window width, h, determines their width”:

  • x <- c(0.0, 1.0, 1.1, 1.5, 1.9, 2.8, 2.9, 3.5)
  • ggplot(data.frame(x=x), aes(x=x)) + geom_rug(col=“dark green”) + lims(x=c(-2, 6))
  • The normal KDE would build a normal around each point, then aggregate all the normals as the distribution
  • The mode of a continuous distribution is defined as the peak of the KDE
  • Generally, the defaults can be used for the bandwidth, h, though it can be adjusted if wanted
  • The plots can extend beyond the limits of the data, which is mathematically correct but may be off-putting
    • The geom_density() function will cut off at the observed maximum and minimum (the area under the curve will no longer be 1, though)

Example code includes:

# Set up keys
t_norm <- c(-0.560475646552213,-0.23017748948328,1.55870831414912,0.070508391424576,0.129287735160946,1.71506498688328,0.460916205989202,-1.26506123460653,-0.686852851893526,-0.445661970099958,1.22408179743946,0.359813827057364,0.400771450594052,0.11068271594512,-0.555841134754075,1.78691313680308,0.497850478229239,-1.96661715662964,0.701355901563686,-0.472791407727934,-1.06782370598685,-0.217974914658295,-1.02600444830724,-0.72889122929114,-0.625039267849257,-1.68669331074241,0.837787044494525,0.153373117836515,-1.13813693701195,1.25381492106993,0.426464221476814,-0.295071482992271,0.895125661045022,0.878133487533042,0.821581081637487,0.688640254100091,0.553917653537589,-0.0619117105767217,-0.305962663739917,-0.380471001012383,-0.694706978920513,-0.207917278019599,-1.26539635156826,2.16895596533851,1.20796199830499,-1.12310858320335,-0.402884835299076,-0.466655353623219,0.779965118336318,-0.0833690664718293,0.253318513994755,-0.028546755348703,-0.0428704572913161,1.36860228401446,-0.225770985659268,1.51647060442954,-1.54875280423022,0.584613749636069,0.123854243844614,0.215941568743973,0.379639482759882,-0.502323453109302,-0.33320738366942,-1.01857538310709,-1.07179122647558,0.303528641404258,0.448209778629426,0.0530042267305041,0.922267467879738,2.05008468562714,-0.491031166056535,-2.30916887564081,1.00573852446226,-0.709200762582393,-0.688008616467358,1.0255713696967,-0.284773007051009,-1.22071771225454,0.18130347974915,-0.138891362439045,0.00576418589988693,0.38528040112633,-0.370660031792409,0.644376548518833,-0.220486561818751,0.331781963915697,1.09683901314935,0.435181490833803,-0.325931585531227,1.14880761845109,0.993503855962119,0.54839695950807,0.238731735111441,-0.627906076039371,1.36065244853001,-0.600259587147127,2.18733299301658,1.53261062618519,-0.235700359100477,-1.02642090030678,-0.710406563699301,0.25688370915653,-0.246691878462374,-0.347542599397733,-0.951618567265016,-0.0450277248089203,-0.784904469457076,-1.66794193658814,-0.380226520287762,0.918996609060766,-0.575346962608392,0.607964322225033,-1.61788270828916,-0.0555619655245394,0.519407203943462,0.301153362166714,0.105676194148943,-0.640706008305376,-0.849704346033582,-1.02412879060491,0.117646597100126,-0.947474614184802,-0.490557443700668,-0.256092192198247,1.84386200523221,-0.651949901695459,0.235386572284857,0.0779608495637108,-0.961856634130129,-0.0713080861235987,1.44455085842335,0.451504053079215,0.0412329219929399,-0.422496832339625,-2.05324722154052,1.13133721341418,-1.46064007092482,0.739947510877334,1.90910356921748,-1.4438931609718,0.701784335374711,-0.262197489402468,-1.57214415914549,-1.51466765378175,-1.60153617357459,-0.530906522170303,-1.4617555849959,0.687916772975828,2.10010894052567,-1.28703047603518,0.787738847475178,0.76904224100091,0.332202578950118,-1.00837660827701,-0.119452606630659,-0.280395335170247,0.56298953322048,-0.372438756103829,0.976973386685621,-0.374580857767014,1.05271146557933,-1.04917700666607,-1.26015524475811,3.2410399349424,-0.416857588160432,0.298227591540715,0.636569674033849,-0.483780625708744,0.516862044313609,0.368964527385086,-0.215380507641693,0.0652930335253153,-0.034067253738464,2.12845189901618,-0.741336096272828,-1.09599626707466,0.0377883991710788,0.310480749443137,0.436523478910183,-0.458365332711106,-1.06332613397119,1.26318517608949,-0.349650387953555,-0.865512862653374,-0.236279568941097,-0.197175894348552,1.10992028971364,0.0847372921971965,0.754053785184521,-0.499292017172261,0.214445309581601,-0.324685911490835,0.0945835281735714,-0.895363357977542,-1.31080153332797,1.99721338474797,0.600708823672418,-1.25127136162494,-0.611165916680421,-1.18548008459731)
t_bimodal <- c(0.19881034888372,-0.68758702356649,-2.26514505669635,-1.45680594076791,-2.41433994791886,-2.47624689461558,-2.78860283785024,-2.59461726745951,-0.34909253266331,-2.05402812508544,-1.88075476357242,-1.75631257040091,-0.767524121514662,-2.51606383094478,-2.99250715039204,-0.32430306759681,-2.44116321690529,-2.72306596993987,-3.23627311888329,-3.2847157223178,-2.57397347929799,-1.38201418283347,-0.890151861070282,-1.29241164616441,-2.36365729709525,-1.9402500626154,-2.70459646368007,-2.71721816157401,-1.11534950102308,-3.01559257860354,-0.0447060345075361,-2.09031959396585,-1.78546117337078,-2.73852770473957,-2.57438868976327,-3.31701613230524,-2.18292538837273,-1.58101759507554,-1.67569565583862,-2.78153648705475,-2.788621970854,-2.50219871834286,-0.503939330153649,-3.13730362066574,-2.1790515943802,-0.0976381783210729,-2.10097488532881,-3.35984070382139,-2.66476943527406,-1.51454002109512,-2.37560287166977,-2.56187636354978,-2.34391723412846,-1.90950335286078,-0.401491228854174,-2.08856511213888,-0.919200503848483,-1.36924588434943,-2.11363989550614,-3.5329020028906,-2.52111731755252,-2.48987045313847,-1.95284556723847,-0.699801322333179,0.293078973831094,-0.452418941016231,-2.13315096432894,-3.75652739555764,-2.38877986407174,-1.91079277692671,-1.15498699593256,-1.03747203151573,-1.31569057058354,-3.39527434979947,-1.15035695436664,-2.44655721642722,-1.82519729983874,-1.92544882282627,-1.57183323502949,-1.97532501717386,-3.66747509758566,-1.26350403522656,-1.61397343165032,-2.26565162527822,-1.88185548895332,-1.86596135463154,-1.778980531439,-0.359153834022514,-2.21905037893348,-1.83193461611534,-0.831616126930907,-0.945818976623081,-0.854736889619643,-2.57746800105956,0.00248273029282942,-1.93329912906982,-0.133148155293138,-3.35090268603071,-1.97901641364576,-0.750085429030784,1.2847578127772,1.24731103178226,1.06146129639311,0.947486720661263,1.5628404668196,2.33117917295898,-0.01421049792072,2.21198043337229,3.23667504641657,4.03757401824044,3.30117599220059,2.75677476379596,0.27326960088567,1.39849329199322,1.64795354341738,2.70352390275689,1.89432866599623,0.741351371939828,3.68443570809411,2.91139129179596,2.23743027249103,3.21810861032581,0.661225712765028,2.6608202977898,1.47708762368658,2.68374552185071,1.93917804533993,2.63296071303145,3.33551761505939,2.0072900903169,3.01755863695209,0.811565964852021,1.27839555956398,3.51921771138818,2.37738797302393,-0.0522228204337298,0.635962547917625,1.79921898441088,2.86577940433449,1.89811674428478,2.62418747202065,2.95900537778783,3.67105482886294,2.05601673327496,1.9480180938191,0.24676264085773,2.09932759408783,1.42814994210444,1.02599041719591,1.82009376895246,3.01494317274366,0.00725151131141755,1.57272071279457,2.11663728358271,1.10679242994505,2.33390294249923,2.41142992061573,1.96696384072401,-0.465898193760027,4.57145814586664,1.7947007425318,2.65119328158767,2.27376649103655,3.02467323481835,2.81765944637409,1.79020682877149,2.37816777220851,1.05459116887611,2.85692301089932,1.53896166111565,4.41677335378821,0.348951104311813,1.53601275703399,2.82537986275924,2.51013254687866,1.410518961485,1.00321925779248,2.1444757047107,1.98569258683309,0.20971876273594,2.03455106713385,2.19023031569246,2.17472639698184,0.944982957397319,2.47613327830263,3.37857013695924,2.45623640317981,0.864411529625657,1.5643545303081,2.34610361955361,1.35295436868173,-0.157646335015277,2.88425082002821,1.17052238837548,1.42643972923172,3.50390060900454,1.22585507039459,2.8457315401893,0.739317121181225,1.64545759692602)
t_uniform <- c(-0.117272665724158,-0.536618106998503,-1.51491178292781,-1.81202527787536,-0.948814783245325,1.87456467188895,-0.0460180705413222,-0.0887118810787797,0.995171524584293,0.670560925267637,-1.80233761295676,0.780420977622271,-0.546953733079135,1.53653442952782,1.10118891857564,-1.44318543467671,-0.819962915033102,-1.49566885828972,0.359606495127082,0.246702435426414,0.754884480498731,-0.754916763864458,0.422347365878522,1.96413727570325,0.972819660790265,-1.69657147862017,-0.195324374362826,-1.78585226181895,-0.641777943819761,0.935808595269918,-1.98357223812491,1.08763792924583,-0.148099155165255,0.883361400105059,0.666022868826985,0.288294882513583,0.815251800231636,0.628884241916239,-0.842591419816017,-1.61104217730463,1.84968528430909,0.945336116477847,0.450893874280155,-1.52028447668999,0.201036185957491,-0.948974888771772,1.5934433247894,-1.96328021679074,-1.05506025627255,-1.47982161864638,-0.695067413151264,0.90559555310756,1.96698094159365,0.86053411103785,0.0177592439576983,-0.255809312686324,1.79530100245029,-1.51927404943854,-1.69941929355264,1.55608597118407,-0.302191619761288,-1.8305572848767,0.5897685745731,-0.125523350201547,0.471704493276775,-0.916738269850612,-1.3708188533783,-1.54298291914165,0.0307314526289701,0.192129150032997,-1.43741524219513,-1.32092379964888,1.0479412926361,0.1095797624439,1.44395743031055,0.69421995151788,-1.94783588126302,0.77279560547322,1.56685492862016,0.52740070130676,-1.57082138024271,1.684260815382,0.701449524611235,-1.40562514960766,0.981367369182408,1.77039544750005,-0.316866497509181,-0.809107687324286,-0.962293319404125,-1.10847473237664,0.262617373839021,1.02660053130239,0.678414183668792,0.186091177165508,1.24585463106632,1.03666720818728,-1.9197261352092,-0.476385199464858,-1.79647954553366,1.19161174912006,1.69479685276747,0.170393481850624,1.40945840068161,0.33425145316869,0.673294574022293,0.0452583860605955,1.05100235715508,1.61344915162772,1.28189804870635,-1.71427260152996,-1.98441462777555,-1.79120553657413,1.46624071896076,0.304980675689876,-0.744629711844027,1.83786313515157,0.364775028079748,0.125637343153358,-0.464253313839436,-0.721787082031369,1.23354502115399,-1.83232200611383,-0.545029221102595,1.4263878678903,0.791786313988268,0.737945891916752,-0.607939789071679,0.218727317638695,-1.45102552976459,1.13972622528672,1.54745028633624,-1.18361646682024,1.08249184582382,0.385451843030751,1.83067896962166,-1.36524640955031,0.103897096589208,1.49260547012091,1.47882428113371,-1.90524542704225,1.90355877391994,-0.0391032071784139,-0.443318707868457,-0.329780160449445,-1.62829671800137,-1.35276315920055,-0.378334019333124,-0.632742223329842,-0.33897018712014,-0.783790123648942,0.241122149862349,-1.37650339771062,1.82631905656308,-1.82413349859416,-0.511369029060006,1.85046136565506,0.581710062921047,-1.75494954269379,-0.360216286033392,-0.296379463747144,0.0326323388144374,-0.201600356958807,0.493045534007251,-1.44008827582002,1.63178560324013,0.277773099020123,0.193122295662761,-1.53268950991333,1.04811332933605,-0.0865222131833434,1.12787565868348,-1.8158938055858,1.27937867119908,-0.922366210259497,-0.868598341941833,0.572886237874627,1.79247535113245,-1.97200618218631,-0.593529311940074,-0.323817100375891,-0.168492811731994,0.846770317293704,1.67939223069698,0.508442802354693,1.60872711334378,1.02931663673371,-1.44856653735042,-1.38698048796505,-1.2347512524575,-0.267259437590837,-1.65112279262394,-1.10487999580801,0.28859471809119,-0.399323286488652,0.261861522682011,1.31849551480263,0.568455277942121,-0.43400499317795,0.838319418951869,-1.56470370758325)
test_data <- data.frame(norm=t_norm, bimodal=t_bimodal, uniform=t_uniform)
small_data <- data.frame(x=c(-3.5, 0, 0.5, 6))

# Calculating density: d
d <- density(test_data$norm)

# Use which.max() to calculate mode
mode <- d$x[which.max(d$y)]

# Finish the ggplot call
ggplot(test_data, aes(x = norm)) +
  geom_rug() +
  geom_density() +
  geom_vline(xintercept = mode, col = "red")

# Arguments you'll need later on
fun_args <- list(mean = mean(test_data$norm), sd = sd(test_data$norm))

# Finish the ggplot
ggplot(test_data, aes(x = norm)) + 
  geom_histogram(aes(y=..density..)) + 
  geom_density(col="red") + 
  stat_function(fun=dnorm, args=fun_args, col="blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Get the bandwith
get_bw <- density(small_data$x)$bw

# Basic plotting object
p <- ggplot(small_data, aes(x = x)) +
  geom_rug() +
  coord_cartesian(ylim = c(0,0.5))

# Create three plots
p + geom_density()

p + geom_density(adjust=0.25)

# p + geom_density(bw = 0.25*get_bw)  ** does not work with my version of R/ggplot2 **

# Create two plots
p + geom_density(kernel="r")

p + geom_density(kernel="e")

There are multiple options for handling multiple groups (levels within a factor) or variables:

  • One option for box plots is to use geom_boxplot(varwidth=TRUE)
  • The violin plot is also nice for this - geom_violin(aes(weight=), col=NA) # to use default colors and weight by a given variable
  • The 2D density plot can help show bimodal data - geom_density_2d()
    • Alternately, can use library(viridis); data(faithful)
    • ggplot(faithful, aes(x=waiting, y=eruptions)) + stat_density_2d(geom = “tile”, aes(fill = ..density..), contour=FALSE) + scale_fill_viridis()

Example code includes:

# Create the data
norm0 <- c(-0.56, -0.23, 1.56, 0.07, 0.13, 1.72, 0.46, -1.27, -0.69, -0.45, 1.22, 0.36, 0.4, 0.11, -0.56, 1.79, 0.5, -1.97, 0.7, -0.47, -1.07, -0.22, -1.03, -0.73, -0.63, -1.69, 0.84, 0.15, -1.14, 1.25, 0.43, -0.3, 0.9, 0.88, 0.82, 0.69, 0.55, -0.06, -0.31, -0.38, -0.69, -0.21, -1.27, 2.17, 1.21, -1.12, -0.4, -0.47, 0.78, -0.08, 0.25, -0.03, -0.04, 1.37, -0.23, 1.52, -1.55, 0.58, 0.12, 0.22, 0.38, -0.5, -0.33, -1.02, -1.07, 0.3, 0.45, 0.05, 0.92, 2.05, -0.49, -2.31, 1.01, -0.71, -0.69, 1.03, -0.28, -1.22, 0.18, -0.14, 0.01, 0.39, -0.37, 0.64, -0.22, 0.33, 1.1, 0.44, -0.33, 1.15, 0.99, 0.55, 0.24, -0.63, 1.36, -0.6, 2.19, 1.53, -0.24, -1.03, -0.71, 0.26, -0.25, -0.35, -0.95, -0.05, -0.78, -1.67, -0.38, 0.92, -0.58, 0.61, -1.62, -0.06, 0.52, 0.3, 0.11, -0.64, -0.85, -1.02, 0.12, -0.95, -0.49, -0.26, 1.84, -0.65, 0.24, 0.08, -0.96, -0.07, 1.44, 0.45, 0.04, -0.42, -2.05, 1.13, -1.46, 0.74, 1.91, -1.44, 0.7, -0.26, -1.57, -1.51, -1.6, -0.53, -1.46, 0.69, 2.1, -1.29, 0.79, 0.77, 0.33, -1.01, -0.12, -0.28, 0.56, -0.37, 0.98, -0.37, 1.05, -1.05, -1.26, 3.24, -0.42, 0.3, 0.64, -0.48, 0.52, 0.37, -0.22, 0.07, -0.03, 2.13, -0.74, -1.1, 0.04, 0.31, 0.44, -0.46, -1.06, 1.26, -0.35, -0.87, -0.24, -0.2, 1.11, 0.08, 0.75, -0.5, 0.21, -0.32, 0.09, -0.9, -1.31, 2, 0.6, -1.25, -0.61, -1.19)
bimodal0 <- c(0.2, -0.69, -2.27, -1.46, -2.41, -2.48, -2.79, -2.59, -0.35, -2.05, -1.88, -1.76, -0.77, -2.52, -2.99, -0.32, -2.44, -2.72, -3.24, -3.28, -2.57, -1.38, -0.89, -1.29, -2.36, -1.94, -2.7, -2.72, -1.12, -3.02, -0.04, -2.09, -1.79, -2.74, -2.57, -3.32, -2.18, -1.58, -1.68, -2.78, -2.79, -2.5, -0.5, -3.14, -2.18, -0.1, -2.1, -3.36, -2.66, -1.51, -2.38, -2.56, -2.34, -1.91, -0.4, -2.09, -0.92, -1.37, -2.11, -3.53, -2.52, -2.49, -1.95, -0.7, 0.29, -0.45, -2.13, -3.76, -2.39, -1.91, -1.15, -1.04, -1.32, -3.4, -1.15, -2.45, -1.83, -1.93, -1.57, -1.98, -3.67, -1.26, -1.61, -2.27, -1.88, -1.87, -1.78, -0.36, -2.22, -1.83, -0.83, -0.95, -0.85, -2.58, 0, -1.93, -0.13, -3.35, -1.98, -0.75, 1.28, 1.25, 1.06, 0.95, 1.56, 2.33, -0.01, 2.21, 3.24, 4.04, 3.3, 2.76, 0.27, 1.4, 1.65, 2.7, 1.89, 0.74, 3.68, 2.91, 2.24, 3.22, 0.66, 2.66, 1.48, 2.68, 1.94, 2.63, 3.34, 2.01, 3.02, 0.81, 1.28, 3.52, 2.38, -0.05, 0.64, 1.8, 2.87, 1.9, 2.62, 2.96, 3.67, 2.06, 1.95, 0.25, 2.1, 1.43, 1.03, 1.82, 3.01, 0.01, 1.57, 2.12, 1.11, 2.33, 2.41, 1.97, -0.47, 4.57, 1.79, 2.65, 2.27, 3.02, 2.82, 1.79, 2.38, 1.05, 2.86, 1.54, 4.42, 0.35, 1.54, 2.83, 2.51, 1.41, 1, 2.14, 1.99, 0.21, 2.03, 2.19, 2.17, 0.94, 2.48, 3.38, 2.46, 0.86, 1.56, 2.35, 1.35, -0.16, 2.88, 1.17, 1.43, 3.5, 1.23, 2.85, 0.74, 1.65)
value2 <- c(-0.56, -0.23, 1.559, 0.071, 0.129, 1.715, 0.461, -1.265, -0.687, -0.446, 1.224, 0.36, 0.401, 0.111, -0.556, 1.787, 0.498, -1.967, 0.701, -0.473, -1.068, -0.218, -1.026, -0.729, -0.625, -1.687, 0.838, 0.153, -1.138, 1.254, 0.426, -0.295, 0.895, 0.878, 0.822, 0.689, 0.554, -0.062, -0.306, -0.38, -0.695, -0.208, -1.265, 2.169, 1.208, -1.123, -0.403, -0.467, 0.78, -0.083, 0.253, -0.029, -0.043, 1.369, -0.226, 1.516, -1.549, 0.585, 0.124, 0.216, 0.38, -0.502, -0.333, -1.019, -1.072, 0.304, 0.448, 0.053, 0.922, 2.05, -0.491, -2.309, 1.006, -0.709, -0.688, 1.026, -0.285, -1.221, 0.181, -0.139, 0.006, 0.385, -0.371, 0.644, -0.22, 0.332, 1.097, 0.435, -0.326, 1.149, 0.994, 0.548, 0.239, -0.628, 1.361, -0.6, 2.187, 1.533, -0.236, -1.026, -0.71, 0.257, -0.247, -0.348, -0.952, -0.045, -0.785, -1.668, -0.38, 0.919, -0.575, 0.608, -1.618, -0.056, 0.519, 0.301, 0.106, -0.641, -0.85, -1.024, 0.118, -0.947, -0.491, -0.256, 1.844, -0.652, 0.235, 0.078, -0.962, -0.071, 1.445, 0.452, 0.041, -0.422, -2.053, 1.131, -1.461, 0.74, 1.909, -1.444, 0.702, -0.262, -1.572, -1.515, -1.602, -0.531, -1.462, 0.688, 2.1, -1.287, 0.788, 0.769, 0.332, -1.008, -0.119, -0.28, 0.563, -0.372, 0.977, -0.375, 1.053, -1.049, -1.26, 3.241, -0.417, 0.298, 0.637, -0.484, 0.517, 0.369, -0.215, 0.065, -0.034, 2.128, -0.741, -1.096, 0.038, 0.31, 0.437, -0.458, -1.063, 1.263, -0.35, -0.866, -0.236, -0.197, 1.11, 0.085, 0.754, -0.499, 0.214, -0.325, 0.095, -0.895, -1.311, 1.997, 0.601, -1.251, -0.611, -1.185, 0.199, -0.688, -2.265, -1.457, -2.414, -2.476, -2.789, -2.595, -0.349, -2.054, -1.881, -1.756, -0.768, -2.516, -2.993, -0.324, -2.441, -2.723, -3.236, -3.285, -2.574, -1.382, -0.89, -1.292, -2.364, -1.94, -2.705, -2.717, -1.115, -3.016, -0.045, -2.09, -1.785, -2.739, -2.574, -3.317, -2.183, -1.581, -1.676, -2.782, -2.789, -2.502, -0.504, -3.137, -2.179, -0.098, -2.101, -3.36, -2.665, -1.515, -2.376, -2.562, -2.344, -1.91, -0.401, -2.089, -0.919, -1.369, -2.114, -3.533, -2.521, -2.49, -1.953, -0.7, 0.293, -0.452, -2.133, -3.757, -2.389, -1.911, -1.155, -1.037, -1.316, -3.395, -1.15, -2.447, -1.825, -1.925, -1.572, -1.975, -3.667, -1.264, -1.614, -2.266, -1.882, -1.866, -1.779, -0.359, -2.219, -1.832, -0.832, -0.946, -0.855, -2.577, 0.002, -1.933, -0.133, -3.351, -1.979, -0.75, 1.285, 1.247, 1.061, 0.947, 1.563, 2.331, -0.014, 2.212, 3.237, 4.038, 3.301, 2.757, 0.273, 1.398, 1.648, 2.704, 1.894, 0.741, 3.684, 2.911, 2.237, 3.218, 0.661, 2.661, 1.477, 2.684, 1.939, 2.633, 3.336, 2.007, 3.018, 0.812, 1.278, 3.519, 2.377, -0.052, 0.636, 1.799, 2.866, 1.898, 2.624, 2.959, 3.671, 2.056, 1.948, 0.247, 2.099, 1.428, 1.026, 1.82, 3.015, 0.007, 1.573, 2.117, 1.107, 2.334, 2.411, 1.967, -0.466, 4.571, 1.795, 2.651, 2.274, 3.025, 2.818, 1.79, 2.378, 1.055, 2.857, 1.539, 4.417, 0.349, 1.536, 2.825, 2.51, 1.411, 1.003, 2.144, 1.986, 0.21, 2.035, 2.19, 2.175, 0.945, 2.476, 3.379, 2.456, 0.864, 1.564, 2.346, 1.353, -0.158, 2.884, 1.171, 1.426, 3.504, 1.226, 2.846, 0.739, 1.645)
dist2 <- c(1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2)
test_data <- data.frame(norm=norm0, bimodal=bimodal0)
test_data2 <- data.frame(dist=factor(dist2, labels=c("norm", "bimodal")), value=value2)

str(test_data)
## 'data.frame':    200 obs. of  2 variables:
##  $ norm   : num  -0.56 -0.23 1.56 0.07 0.13 1.72 0.46 -1.27 -0.69 -0.45 ...
##  $ bimodal: num  0.2 -0.69 -2.27 -1.46 -2.41 -2.48 -2.79 -2.59 -0.35 -2.05 ...
str(test_data2)
## 'data.frame':    400 obs. of  2 variables:
##  $ dist : Factor w/ 2 levels "norm","bimodal": 1 1 1 1 1 1 1 1 1 1 ...
##  $ value: num  -0.56 -0.23 1.559 0.071 0.129 ...
# Plot with test_data
ggplot(test_data, aes(x = norm)) +
  geom_rug() + 
  geom_density()

# Plot two distributions with test_data2
ggplot(test_data2, aes(x = value, fill = dist, col = dist)) +
  geom_rug(alpha=0.6) + 
  geom_density(alpha=0.6)

data(msleep); mammals <- msleep[!is.na(msleep$vore), c("vore", "sleep_total")]

# Individual densities
ggplot(mammals[mammals$vore == "insecti", ], aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3))

# With faceting
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3)) +
  facet_wrap(~ vore, nrow=2)

# Note that by default, the x ranges fill the scale
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3))

# Trim each density plot individually
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35, trim=TRUE) +
  scale_x_continuous(limits=c(0,24)) +
  coord_cartesian(ylim = c(0, 0.3))

# Density plot from before
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3))

# Finish the dplyr command
library(dplyr)
mammals2 <- mammals %>%
  group_by(vore) %>%
  mutate(n=n()/nrow(mammals))

# Density plot, weighted
ggplot(mammals2, aes(x = sleep_total, fill = vore)) +
  geom_density(aes(weight = n), col=NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3))
## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

# Violin plot
ggplot(mammals, aes(x = vore, y = sleep_total, fill = vore)) +
  geom_violin()

# Violin plot, weighted
ggplot(mammals2, aes(x = vore, y = sleep_total, fill = vore)) +
  geom_violin(aes(weight = n), col=NA)
## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

## Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
## = kernel, : sum(weights) != 1 -- will not get true density

data(faithful)

# Base layers
p <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
  scale_y_continuous(limits = c(1, 5.5), expand = c(0, 0)) +
  scale_x_continuous(limits = c(40, 100), expand = c(0, 0)) +
  coord_fixed(60 / 4.5)

# Use geom_density_2d()
p + geom_density_2d()

# Use stat_density_2d()
p + stat_density_2d(h=c(5, 0.5), aes(col=..level..))

# Load in the viridis package
library(viridis)
## Warning: package 'viridis' was built under R version 3.2.5
# Load in the viridis package
library(viridis)

# Add viridis color scale
ggplot(faithful, aes(x = waiting, y = eruptions)) +
  scale_y_continuous(limits = c(1, 5.5), expand = c(0,0)) +
  scale_x_continuous(limits = c(40, 100), expand = c(0,0)) +
  coord_fixed(60/4.5) +
  stat_density_2d(geom = "tile", aes(fill = ..density..), h=c(5,.5), contour = FALSE) +
  scale_fill_viridis()

Defining largeness of a dataset - # observations, # variables, etc. For many observations (e.g., “diamonds” dataset) - can adjust point size, alpha blending, 2-d contours, etc.:

  • Alternately, can bin the values in a 2D manner - geom_bin2d(bins=)
  • Can also use geom_hex(bins=) for hex_binning - uses hexagons rather than squares

For many variables, can use pre-processing such as PCA:

  • Can plot using facets for factor variables, but this is only good for small numbers of facets
  • SPLOM - scatterplot matrix, e.g., pairs()
  • library(PerformanceAnalytics); chart.Correlation(iris[-5])
  • library(Ggally); ggpairs(mtcars)

Ternary plot - triangle plot where all the components add to 100% (e.g., soil composition):

  • Soil has a sand, silt, and clay component
  • Each variable of interest forms a side of a triangle

Network plots (e.g., blood type donors):

  • Which blood types can be donors for other types?
  • The network is called by geom_net()

Diagnostic plots:

  • Can plot OLS models and qualities of fit
  • Calling plot() on an lm result provides four plots
  • Can use car::qqplot() or car::influencePlot()

Example code includes:

# pairs
data(iris)
pairs(iris[, 1:4])

# chart.Correlation
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 3.2.5
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
chart.Correlation(iris[, 1:4])
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "method" is
## not a graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "method" is
## not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "method" is
## not a graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "method" is
## not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "method" is not a
## graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "method" is not a
## graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "method" is not a
## graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "method" is
## not a graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "method" is
## not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "method" is
## not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "method" is not a
## graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "method" is not a
## graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "method" is
## not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "method" is not a
## graphical parameter
## Warning in plot.window(...): "method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "method" is not a graphical parameter
## Warning in title(...): "method" is not a graphical parameter

# ggpairs
library(GGally)
data(mtcars); mtcars_fact <- mtcars; mtcars_fact$cyl <- as.factor(mtcars_fact$cyl)
ggpairs(mtcars_fact[, 1:3])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
cor_list <- function(x) {
  L <- M <- cor(x)
  
  M[lower.tri(M, diag = TRUE)] <- NA
  M <- melt(M)
  names(M)[3] <- "points"
  
  L[upper.tri(L, diag = TRUE)] <- NA
  L <- melt(L)
  names(L)[3] <- "labels"
  
  merge(M, L)
}

# Calculate xx with cor_list
library(dplyr)
xx <- iris %>%
  group_by(Species) %>%
  do(cor_list(.[1:4])) 

# Finish the plot
ggplot(xx, aes(x = Var1, y = Var2)) +
  geom_point(aes(col = points, size = abs(points)), shape = 16) +
  geom_text(aes(col = labels,  size = labels, label = round(labels, 2))) +
  scale_size(range = c(0, 6)) +
  scale_color_gradient("r", limits = c(-1, 1)) +
  scale_y_discrete("", limits = rev(levels(xx$Var1))) +
  scale_x_discrete("") +
  guides(size = FALSE) +
  geom_abline(slope = -1, intercept = nlevels(xx$Var1) + 1) +
  coord_fixed() +
  facet_grid(. ~ Species) +
  theme(axis.text.y = element_text(angle = 45, hjust = 1),
        axis.text.x = element_text(angle = 45, hjust = 1),
        strip.background = element_blank())
## Warning: Removed 30 rows containing missing values (geom_point).
## Warning: Removed 30 rows containing missing values (geom_text).

# Explore africa
library(GSIF)
## Warning: package 'GSIF' was built under R version 3.2.5
## GSIF version 0.5-3 (2016-07-18)
## URL: http://gsif.r-forge.r-project.org/
data(afsp)
str(afsp)
## List of 2
##  $ sites   :'data.frame':    26270 obs. of  9 variables:
##   ..$ SOURCEID: Factor w/ 26270 levels "100902","100903",..: 5606 5607 5608 5609 5610 5611 5604 5605 5681 5682 ...
##   ..$ SOURCEDB: Factor w/ 7 levels "Af_LDSF","Af_soilspec",..: 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ LONWGS84: num [1:26270] 36.1 36.1 36.1 36.1 36.1 ...
##   ..$ LATWGS84: num [1:26270] -6.93 -6.93 -6.93 -6.93 -6.93 ...
##   ..$ TIMESTRR: Date[1:26270], format: "2012-10-02" ...
##   ..$ TAXGWRB : Factor w/ 40 levels "AC","AL","Alisol",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..$ TAXNUSDA: Factor w/ 960 levels "\"Plinthic\" Udoxic Dystropept",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..$ BDRICM  : num [1:26270] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ DRAINFAO: Factor w/ 7 levels "E","I","M","P",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ horizons:'data.frame':    87693 obs. of  14 variables:
##   ..$ SOURCEID: Factor w/ 26270 levels "100902","100903",..: 5606 5606 5607 5607 5607 5607 5608 5608 5609 5609 ...
##   ..$ UHDICM  : num [1:87693] 0 20 0 20 0 20 0 20 0 20 ...
##   ..$ LHDICM  : num [1:87693] 20 50 20 50 20 50 20 50 20 50 ...
##   ..$ MCOMNS  : Factor w/ 289 levels "10BG4/1","10R2.5/1",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..$ ORCDRC  : num [1:87693] 4.36 3.69 3.67 4 5.25 4.57 6.02 3.6 3.33 5.33 ...
##   ..$ PHIHOX  : num [1:87693] 7.23 7.3 7.06 6.95 7.21 7.25 7.35 7.44 7.26 6.94 ...
##   ..$ SNDPPT  : num [1:87693] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ SLTPPT  : num [1:87693] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ CLYPPT  : num [1:87693] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ CRFVOL  : num [1:87693] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ BLD     : num [1:87693] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ CEC     : num [1:87693] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ NTO     : num [1:87693] 0.381 0.334 0.318 0.366 0.439 0.404 0.512 0.334 0.329 0.482 ...
##   ..$ EMGX    : num [1:87693] 2.66 2.95 2.73 3.84 2.36 3.16 3.59 4.37 3.71 2.76 ...
africa <- afsp$horizons[, c("SNDPPT", "SLTPPT", "CLYPPT")]
africa <- africa[complete.cases(africa), ]
africa <- africa / rowSums(africa)
africa <- africa %>% rename(Sand=SNDPPT, Silt=SLTPPT, Clay=CLYPPT)
str(africa)
## 'data.frame':    53830 obs. of  3 variables:
##  $ Sand: num  0.845 0.838 0.694 0.661 0.647 ...
##  $ Silt: num  0.0326 0.0313 0.0387 0.0782 0.1176 ...
##  $ Clay: num  0.122 0.131 0.268 0.261 0.235 ...
# Sample the dataset
africa_sample <- africa[sample(1:nrow(africa), size = 50),]

# Add an ID column from the row.names
africa_sample$ID <- row.names(africa_sample)

# Gather africa_sample
library(tidyr)
africa_sample_tidy <- gather(africa_sample, key, value, -ID)

# Finish the ggplot command
ggplot(africa_sample_tidy, aes(x = factor(ID), y = value, fill = key)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_x_discrete(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0)) +
  labs(x = "Location", y = "Composition", fill = "Component") +
  theme_minimal()

# Load ggtern
library(ggtern)
## Warning: package 'ggtern' was built under R version 3.2.5
## --
## Consider donating at: http://ggtern.com
## Even small amounts (say $10-50) are very much appreciated!
## Remember to cite, run citation(package = 'ggtern') for further info.
## --
## 
## Attaching package: 'ggtern'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, aes, annotate, calc_element, ggplot, ggplot_build,
##     ggplot_gtable, ggplotGrob, ggsave, layer_data, theme,
##     theme_bw, theme_classic, theme_dark, theme_gray, theme_light,
##     theme_linedraw, theme_minimal, theme_void
# Build ternary plot
# ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
#   geom_point(shape=16, alpha=0.2)


# ggtern and ggplot2 are loaded

# Plot 1
# ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
#   geom_density_tern()

# Plot 2
# ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
#  stat_density_tern(geom="polygon", aes(fill = ..level.., alpha = ..level..), 
#                    guides(fill = guide_legend(show = FALSE))
#                    )


# Load geomnet
library(geomnet)
## Warning: package 'geomnet' was built under R version 3.2.5
# Examine structure of madmen
str(geomnet::madmen)
## List of 2
##  $ edges   :'data.frame':    39 obs. of  2 variables:
##   ..$ Name1: Factor w/ 9 levels "Betty Draper",..: 1 1 2 2 2 2 2 2 2 2 ...
##   ..$ Name2: Factor w/ 39 levels "Abe Drexler",..: 15 31 2 4 5 6 8 9 11 21 ...
##  $ vertices:'data.frame':    45 obs. of  2 variables:
##   ..$ label : Factor w/ 45 levels "Abe Drexler",..: 5 9 16 23 26 32 33 38 39 17 ...
##   ..$ Gender: Factor w/ 2 levels "female","male": 1 2 2 1 2 1 2 2 2 2 ...
# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y="label",
               all = TRUE)

# Examin structure of mmnet
str(mmnet)
## 'data.frame':    75 obs. of  3 variables:
##  $ Name1 : Factor w/ 45 levels "Betty Draper",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ Name2 : Factor w/ 39 levels "Abe Drexler",..: 15 31 2 4 5 6 8 9 11 21 ...
##  $ Gender: Factor w/ 2 levels "female","male": 1 1 2 2 2 2 2 2 2 2 ...
# geomnet is pre-loaded

# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y = "label",
               all = TRUE)

# Finish the ggplot command
# ggplot(data = mmnet, aes(from_id = Name1, to_id = Name2)) +
#   geom_net(aes(col=Gender), size=6, linewidth=1, label=TRUE, fontsize=3, labelcolour="black")

# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y = "label",
               all = TRUE)

# Tweak the network plot
# ggplot(data = mmnet, aes(from_id = Name1, to_id = Name2)) +
#   geom_net(aes(col = Gender),
#            size = 6,
#            linewidth = 1,
#            label = TRUE,
#            fontsize = 3,
#            labelcolour = "black",
#            directed=TRUE) +
#   scale_color_manual(values = c("#FF69B4", "#0099ff")) +
#   xlim(c(-0.05, 1.05)) +
#   ggmap::theme_nothing(legend=TRUE) +
#   theme(legend.key=element_blank())


# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y = "label",
               all = TRUE)


# Create linear model: res
data(trees)
res <- lm(Volume ~ Girth, data = trees)

# Plot res
par(mfrow = c(2, 2))
plot(res)

par(mfrow = c(1, 1))

# Import ggfortify and use autoplot()
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 3.2.5
autoplot(res, ncol=2)

# Inspect structure of Canada
library(vars); data(Canada)
## Warning: package 'vars' was built under R version 3.2.5
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
## 
##     mammals
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.2.4
## Loading required package: sandwich
## Loading required package: urca
## Warning: package 'urca' was built under R version 3.2.5
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.2.5
str(Canada)
##  mts [1:84, 1:4] 930 930 930 931 933 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "e" "prod" "rw" "U"
##  - attr(*, "tsp")= num [1:3] 1980 2001 4
##  - attr(*, "class")= chr [1:2] "mts" "ts"
# Call plot() on Canada
plot(Canada)

# Call autoplot() on Canada
# autoplot(Canada)


# ggfortify and eurodist are available

# Autoplot + ggplot2 tweaking
# autoplot(eurodist) + 
#   labs(x="", y="") +
#   coord_fixed() + 
#   theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

# Autoplot of MDS
# autoplot(cmdscale(eurodist, eig=TRUE), label=TRUE, label.size=3, size=0)


# perform clustering
iris_k <- kmeans(iris[-5], centers=3)

# autplot: coloring according to cluster
# autoplot(iris_k, data=iris, frame=TRUE)

# autoplot: coloring according to species
# autoplot(iris_k, data=iris, frame=TRUE, col="Species")

Choropleths are a series of polygons (or points or lines), and are useful when you have the shapes file and some underlying data. Maps (e.g., full US or by state or etc.) are a primary example - can put right in to ggplot2:

  • The map is not always the best - Cleveland dot plot can be superior (can pick a more sensible order or better see central tendencies)
  • Cleveland dot plots can also be easier for a sub-classification (such as region)

Cartographic maps are an alternative (topographical, photographic, etc.):

  • The ggmap::get_map() is a good way to pull down maps
  • The ggmap::geocode() is a good way to pull down the lat/lon for given landmarks (or the like)
  • The ggmap::make_bbox() is a good way to set a boundary based on some points, as well as a desired buffer

Animations can be useful for dense, temporal data or as an exploratory tool:

  • Motion Chart (Hans Rosling) showing UN data by geography over time
  • The frame aesthetic allows for animation, and is used inside the aes() command
  • The gg_animate() command is called on a ggplot() object, with interval=x to define how quickly to animate

Example code includes:

library(ggplot2)
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.2.5
library(ggthemes)
library(maps)
## Warning: package 'maps' was built under R version 3.2.5
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
library(viridis)


# Basic map of the USA
usa <- ggplot2::map_data("usa")
str(usa)
## 'data.frame':    7243 obs. of  6 variables:
##  $ long     : num  -101 -101 -101 -101 -101 ...
##  $ lat      : num  29.7 29.7 29.7 29.6 29.6 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "main" "main" "main" "main" ...
##  $ subregion: chr  NA NA NA NA ...
ggplot(data=usa, aes(x=long, y=lat, group=group)) + 
    geom_polygon() + 
    coord_map()

# Add USA cities to the USA map (continental US only, cities with 250k+ population)
library(dplyr)
data(us.cities)
cities <- us.cities %>% 
    mutate(City=name, State=country.etc, Pop_est=pop) %>% 
    dplyr::select(City, State, Pop_est, lat, long) %>%
    filter(!(State %in% c("AK", "HI", "ma")) & Pop_est >= 250000)

ggplot(usa, aes(x = long, y = lat, group = group)) + 
    geom_polygon() + 
    geom_point(data=cities, aes(group=State, size=Pop_est, col="red"), shape=16, alpha=0.6) + 
    coord_map() + 
    theme_map()

# Arrange cities
cities_arr <- arrange(cities, Pop_est)

# Create US plot of cities, colored by the viridis theme
ggplot(usa, aes(x = long, y = lat, group = group)) + 
    geom_polygon(fill="grey90") + 
    geom_point(data=cities_arr, aes(group=State, col=Pop_est), shape=16, size=2, alpha=0.6) + 
    coord_map() + 
    theme_map() + 
    scale_color_viridis()

# Create a dataset of populations by state
st_data <- "california ; texas ; florida ; new york ; illinois ; pennsylvania ; ohio ; georgia ; north carolina ; michigan ; new jersey ; virginia ; washington ; arizona ; massachusetts ; indiana ; tennessee ; missouri ; maryland ; wisconsin ; minnesota ; colorado ; south carolina ; alabama ; louisiana ; kentucky ; oregon ; oklahoma ; connecticut ; puerto rico ; iowa ; utah ; mississippi ; arkansas ; kansas ; nevada ; new mexico ; nebraska ; west virginia ; idaho ; hawaii ; new hampshire ; maine ; rhode island ; montana ; delaware ; south dakota ; north dakota ; alaska ; district of columbia ; vermont ; wyoming"

pop_data <- "39144818 ; 27469114 ; 20271272 ; 19795791 ; 12859995 ; 12802503 ; 11613423 ; 10214860 ; 10042802 ; 9922576 ; 8958013 ; 8382993 ; 7170351 ; 6828065 ; 6794422 ; 6619680 ; 6600299 ; 6083672 ; 6006401 ; 5771337 ; 5489594 ; 5456574 ; 4896146 ; 4858979 ; 4670724 ; 4425092 ; 4028977 ; 3911338 ; 3590886 ; 3474182 ; 3123899 ; 2995919 ; 2992333 ; 2978204 ; 2911641 ; 2890845 ; 2085109 ; 1896190 ; 1844128 ; 1654930 ; 1431603 ; 1330608 ; 1329328 ; 1056298 ; 1032949 ; 945934 ; 858469 ; 756927 ; 738432 ; 672228 ; 626042 ; 586107"

pop <- data.frame(region=strsplit(st_data, split=" ; ")[[1]], 
                  Pop_est=as.numeric(strsplit(pop_data, split=" ; ")[[1]]), 
                  stringsAsFactors = FALSE
                  )


# Map the basic state data
state <- map_data("state")
ggplot(data=state, aes(x=long, y=lat, group=group, fill=region)) + 
    geom_polygon(col="white") + 
    coord_map()

# Map the states by population
state2 <- merge(state, pop, by="region")
ggplot(data=state2, aes(x=long, y=lat, group=group, fill=Pop_est)) + 
    geom_polygon(col="white") + 
    coord_map() + 
    theme_map()

# Import shape information: germany (commented out since files not available)
library(rgdal)
## Warning: package 'rgdal' was built under R version 3.2.5
## Loading required package: sp
## Warning: package 'sp' was built under R version 3.2.5
## rgdal: version: 1.2-3, (SVN revision 639)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.0.1, released 2015/09/15
##  Path to GDAL shared files: C:/Users/Dave/Documents/R/win-library/3.2/rgdal/gdal
##  GDAL does not use iconv for recoding strings.
##  Loaded PROJ.4 runtime: Rel. 4.9.1, 04 March 2015, [PJ_VERSION: 491]
##  Path to PROJ.4 shared files: C:/Users/Dave/Documents/R/win-library/3.2/rgdal/proj
##  Linking to sp version: 1.2-3
# germany <- readOGR(dsn="shapes", "DEU_adm1")
# bundes <- fortify(germany)
# 
# ggplot(data=bundes, aes(x=long, y=lat, group=group)) + 
#     geom_polygon(fill="blue", col="white") +
#     coord_map() + 
#     theme_nothing()
# 
# bundes$state <- factor(as.numeric(bundes$id))
# levels(bundes$state) <- germany$NAME_1
# 
# bundes_unemp <- merge(bundes, unemp, by="state")
# 
# ggplot(bundes_unemp, aes(x = long, y = lat, group = group, fill=unemployment)) + 
#     geom_polygon() + 
#     coord_map() + 
#     theme_map()


# Create the map of London
library(ggmap)
london_map_13 <- get_map("London, England", zoom=13)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=London,+England&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=London,%20England&sensor=false
ggmap(london_map_13)

# Experiment with get_map() and use ggmap() to plot it!
# temp1 <- get_map("London, England", zoom=13, maptype="toner", source="stamen")
# ggmap(temp1)

temp2 <- get_map("London, England", zoom=13, maptype="hybrid")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=London,+England&zoom=13&size=640x640&scale=2&maptype=hybrid&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=London,%20England&sensor=false
ggmap(temp2)

# Map some key sites in London
london_sites <- strsplit("Tower of London, London ; Buckingham Palace, London ; Tower Bridge, London ; Queen Elizabeth Olympic Park, London", " ; ")[[1]]

xx <- geocode(london_sites)
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Tower%20of%20London,%20London&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Buckingham%20Palace,%20London&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Tower%20Bridge,%20London&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Queen%20Elizabeth%20Olympic%20Park,%20London&sensor=false
xx$location <- sub(", London", "", london_sites)
london_ton_13 <- get_map(location = "London, England", zoom = 13,
                         source = "stamen", maptype = "toner")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=London,+England&zoom=13&size=640x640&scale=2&maptype=terrain&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=London,%20England&sensor=false
## Map from URL : http://tile.stamen.com/toner/13/4091/2722.png
## Map from URL : http://tile.stamen.com/toner/13/4092/2722.png
## Map from URL : http://tile.stamen.com/toner/13/4093/2722.png
## Map from URL : http://tile.stamen.com/toner/13/4094/2722.png
## Map from URL : http://tile.stamen.com/toner/13/4091/2723.png
## Map from URL : http://tile.stamen.com/toner/13/4092/2723.png
## Map from URL : http://tile.stamen.com/toner/13/4093/2723.png
## Map from URL : http://tile.stamen.com/toner/13/4094/2723.png
## Map from URL : http://tile.stamen.com/toner/13/4091/2724.png
## Map from URL : http://tile.stamen.com/toner/13/4092/2724.png
## Map from URL : http://tile.stamen.com/toner/13/4093/2724.png
## Map from URL : http://tile.stamen.com/toner/13/4094/2724.png
## Map from URL : http://tile.stamen.com/toner/13/4091/2725.png
## Map from URL : http://tile.stamen.com/toner/13/4092/2725.png
## Map from URL : http://tile.stamen.com/toner/13/4093/2725.png
## Map from URL : http://tile.stamen.com/toner/13/4094/2725.png
# Add a geom_points layer
ggmap(london_ton_13) + 
    geom_point(data=xx, aes(col=location), size=6)
## Warning: Removed 1 rows containing missing values (geom_point).

# Expand to use the bounding box
xx <- geocode(london_sites)
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Tower%20of%20London,%20London&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Buckingham%20Palace,%20London&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Tower%20Bridge,%20London&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Queen%20Elizabeth%20Olympic%20Park,%20London&sensor=false
xx$location <- sub(", London", "", london_sites)
xx$location[4] <- "Queen Elizabeth\nOlympic Park"

# Create bounding box: bbox
bbox <- make_bbox(lon = xx$lon, lat = xx$lat, f = 0.3)
london_ton_13 <- get_map(bbox, zoom = 13,
                         source = "stamen", maptype = "toner"
                         )
## Map from URL : http://tile.stamen.com/toner/13/4091/2722.png
## Warning in file.remove(index[[url]]): cannot remove file
## 'a991dd39e80eba942f916d1a39eacba1.rds', reason 'No such file or directory'
## Map from URL : http://tile.stamen.com/toner/13/4092/2722.png
## Warning in file.remove(index[[url]]): cannot remove file
## 'ca4d532407f75fb40f356d82e9f3e868.rds', reason 'No such file or directory'
## Map from URL : http://tile.stamen.com/toner/13/4093/2722.png
## Warning in file.remove(index[[url]]): cannot remove file
## '2d23913e51259d39586aa2f72cf7262a.rds', reason 'No such file or directory'
## Map from URL : http://tile.stamen.com/toner/13/4094/2722.png
## Warning in file.remove(index[[url]]): cannot remove file
## 'd95217086725a265fa36c032fbc90ad6.rds', reason 'No such file or directory'
## Map from URL : http://tile.stamen.com/toner/13/4095/2722.png
## Map from URL : http://tile.stamen.com/toner/13/4096/2722.png
## Map from URL : http://tile.stamen.com/toner/13/4091/2723.png
## Warning in file.remove(index[[url]]): cannot remove file
## '34870d377d252bc5a4f1705aaec727a7.rds', reason 'No such file or directory'
## Map from URL : http://tile.stamen.com/toner/13/4092/2723.png
## Warning in file.remove(index[[url]]): cannot remove file
## 'afecef47c6d1c2bb613ae817f73ef94e.rds', reason 'No such file or directory'
## Map from URL : http://tile.stamen.com/toner/13/4093/2723.png
## Warning in file.remove(index[[url]]): cannot remove file
## 'f0de52ccfe7b807ab9e08f76c37ff7be.rds', reason 'No such file or directory'
## Map from URL : http://tile.stamen.com/toner/13/4094/2723.png
## Warning in file.remove(index[[url]]): cannot remove file
## 'f7fedbdcd7d7ca7fa3714dc4a4b6b802.rds', reason 'No such file or directory'
## Map from URL : http://tile.stamen.com/toner/13/4095/2723.png
## Map from URL : http://tile.stamen.com/toner/13/4096/2723.png
## Map from URL : http://tile.stamen.com/toner/13/4091/2724.png
## Warning in file.remove(index[[url]]): cannot remove file
## '8bebd0980b74870f3d7dc958d655198f.rds', reason 'No such file or directory'
## Map from URL : http://tile.stamen.com/toner/13/4092/2724.png
## Warning in file.remove(index[[url]]): cannot remove file
## '245dcf359e94685490a556539f0ef26f.rds', reason 'No such file or directory'
## Map from URL : http://tile.stamen.com/toner/13/4093/2724.png
## Warning in file.remove(index[[url]]): cannot remove file
## '92a8bede90fdcd1ba21c8c78c10f06b3.rds', reason 'No such file or directory'
## Map from URL : http://tile.stamen.com/toner/13/4094/2724.png
## Warning in file.remove(index[[url]]): cannot remove file
## '9acc9d919db20a0d68fdef3dd4092cfe.rds', reason 'No such file or directory'
## Map from URL : http://tile.stamen.com/toner/13/4095/2724.png
## Map from URL : http://tile.stamen.com/toner/13/4096/2724.png
# Map from previous exercise
ggmap(london_ton_13) + 
    geom_point(data = xx, aes(col = location), size = 6)

# New map with labels
ggmap(london_ton_13) + 
    geom_label(data=xx, aes(label=location), size=4, fontface="bold", fill="grey90", col="#E41A1C")

# Get the map data of "Germany" and Plot map and polygon on top: (not displayed)
# germany_06 <- get_map("Germany", zoom=6)
# ggmap(germany_06) + 
#     geom_polygon(data=bundes, aes(x=long, y=lat, group=group), fill=NA, col="red") + 
#     coord_map()


# Animated Japan map (not shown)
# str(japan)
# 
# saveGIF({
# 
#   for (i in unique(japan$time)) {
# 
#     data <- japan[japan$time == i, ]
# 
#     p <- ggplot(data, aes(x = AGE, y = POP, fill = SEX, width = 1)) +
#       coord_flip() +
#       geom_bar(data = data[data$SEX == "Female",], stat = "identity") +
#       geom_bar(data = data[data$SEX == "Male",], stat = "identity") +
#       ggtitle(i)
# 
#     print(p)
# 
#   }
# 
# }, movie.name = "pyramid.gif", interval = 0.1)


# Animate the vocabularies by year
# library(gganimate)
# library(car)
# data(Vocab)
# p <- ggplot(Vocab, aes(x = education, y = vocabulary,
#                        color = year, group = year, 
#                        frame=year, cumulative=TRUE
#                        )
#             ) +
#     stat_smooth(method = "lm", se = FALSE, size = 3)
# 
# gg_animate(p, filename="vocab.gif", interval = 1.0)

The ggplot2 internals include 2 plotting systems in R (base and grid):

  • base treats the graphics as static (cannot be changed after drawing - need to re-do from scratch)
  • grid provides low-level graphics capabilities, meaning that it can be edited (create graphic outputs, and layer them)
    • Some of the grid capabilities include grid.rect(), grid.lines(), grid.text() and the like

There are many possible customizations to the grid process:

  • gp=gpar() argument inside of each grid.xxx() which allows more control over things like fill
  • name=“” argument which allows re-use of a specific grid-graphic system
  • There are “viewports”, basically a fancy name for graphics displays
    • vp <- viewport() will create a new viewport that is not yet active
    • pushViewport(vp) will push this viewport on to the active device
    • The “viewports” can also be named for re-use
    • There is also a dataViewport() call which associates data to the viewport – vp_data <- dataViewport(mtcars\(wt, mtcars\)mpg)
  • grid.edit() is the magic – can supply a named object and change it through this

The ggplot2 package is built using the grid() framework:

  • There are both graphic outputs (as per above) and graphics objects (“grobs”)
  • Underlying evey ggplot2 function, there will be a collection of grobs
  • The function ggplotGrob() will print the grobs associated to “obj”
  • Since these are at heart lists, can run g$grob[[8]], for example, to extract the eighth item from the grob list
  • Can create new text using textGrob(label=“”, gp=gpar(fontsize=7, col=“gray25”))
    • Can then add this to an existing grob with gtable_add_grob()

Example code includes:

library(grid)
library(gtable)
## Warning: package 'gtable' was built under R version 3.2.5
# Draw rectangle in null viewport
grid.rect(gp=gpar(fill = "grey90"))

# Write text in null viewport
grid.text("null viewport")

# Draw a line
grid.lines(x=c(0, 0.75), y=c(0.25, 1), gp=gpar(lty=2, col="red"))


# Populate null viewport
grid.rect(gp = gpar(fill = "grey90"))
grid.text("null viewport")
grid.lines(x = c(0,0.75), y = c(0.25, 1),
           gp = gpar(lty = 2, col = "red"))

# Create new viewport: vp
vp <- viewport(x=0.5, y=0.5, width=0.5, height=0.5, just="center")

# Push vp
pushViewport(vp)

# Populate new viewport with rectangle
grid.rect(gp=gpar(fill="blue"))


# Create plot viewport: pvp
mar <- c(5, 4, 2, 2)
pvp <- plotViewport(mar)

# Push pvp
pushViewport(pvp)

# Add rectangle
grid.rect(gp=gpar(fill="grey80"))

# Create data viewport: dvp
dvp <- dataViewport(mtcars$wt, mtcars$mpg)

# Push dvp
pushViewport(dvp)

# Add two axes
grid.xaxis()
grid.yaxis()


# Work from before
pushViewport(plotViewport(c(5, 4, 2, 2)))
grid.rect(gp = gpar())
pushViewport(dataViewport(xData = mtcars$wt, yData = mtcars$mpg))
grid.xaxis()
grid.yaxis()

# Add text to x axis
grid.text(label="Weight", y=unit(-3, "lines"))

# Add text to y axis
grid.text(label="MPG", x=unit(-3, "lines"), rot=90)

# Add points
grid.points(x=mtcars$wt, y=mtcars$mpg, pch=16)


# Work from before
pushViewport(plotViewport(c(5, 4, 2, 2)))
grid.rect(gp = gpar())
pushViewport(dataViewport(xData = mtcars$wt, yData = mtcars$mpg))
grid.xaxis()
grid.yaxis()

# Work from before - add names
grid.text("Weight", y = unit(-3, "lines"), name = "xaxis")
grid.text("MPG", x = unit(-3, "lines"), rot = 90, name = "yaxis")
grid.points(x = mtcars$wt, y = mtcars$mpg, pch = 16, name = "datapoints")

# Edit "xaxis"
grid.edit("xaxis", label="Miles/(US) gallon")

# Edit "yaxis"
grid.edit("yaxis", label="Weight (1000 lbs)")

# Edit "datapoints"
grid.edit("datapoints", gp=gpar(col="#C3212766", cex=2))


# A simple plot p
p <- ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point()

# Create gtab with ggplotGrob()
gtab <- ggplotGrob(p)

# Print out gtab
gtab
## TableGrob (10 x 9) "layout": 18 grobs
##     z         cells       name                                     grob
## 1   0 ( 1-10, 1- 9) background        rect[plot.background..rect.15947]
## 2   5 ( 5- 5, 3- 3)     spacer                           zeroGrob[NULL]
## 3   7 ( 6- 6, 3- 3)     axis-l    absoluteGrob[GRID.absoluteGrob.15923]
## 4   3 ( 7- 7, 3- 3)     spacer                           zeroGrob[NULL]
## 5   6 ( 5- 5, 4- 4)     axis-t                           zeroGrob[NULL]
## 6   1 ( 6- 6, 4- 4)      panel               gTree[panel-1.gTree.15903]
## 7   9 ( 7- 7, 4- 4)     axis-b    absoluteGrob[GRID.absoluteGrob.15916]
## 8   4 ( 5- 5, 5- 5)     spacer                           zeroGrob[NULL]
## 9   8 ( 6- 6, 5- 5)     axis-r                           zeroGrob[NULL]
## 10  2 ( 7- 7, 5- 5)     spacer                           zeroGrob[NULL]
## 11 10 ( 4- 4, 4- 4)     xlab-t                           zeroGrob[NULL]
## 12 11 ( 8- 8, 4- 4)     xlab-b titleGrob[axis.title.x..titleGrob.15906]
## 13 12 ( 6- 6, 2- 2)     ylab-l titleGrob[axis.title.y..titleGrob.15909]
## 14 13 ( 6- 6, 6- 6)     ylab-r                           zeroGrob[NULL]
## 15 14 ( 6- 6, 8- 8)  guide-box                        gtable[guide-box]
## 16 15 ( 3- 3, 4- 4)   subtitle  zeroGrob[plot.subtitle..zeroGrob.15944]
## 17 16 ( 2- 2, 4- 4)      title     zeroGrob[plot.title..zeroGrob.15943]
## 18 17 ( 9- 9, 4- 4)    caption   zeroGrob[plot.caption..zeroGrob.15945]
# Extract the grobs from gtab: gtab
g <- gtab$grobs

# Draw only the legend
grid.draw(g[[15]])


# Code from before
p <- ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point()
gtab <- ggplotGrob(p)
g <- gtab$grobs
grid.draw(g[[15]])

# Show layout of g[[15]]
gtable_show_layout(g[[15]])

# Create text grob
my_text <- textGrob(label = "Motor Trend, 1974", gp = gpar(fontsize = 7, col = "gray25"))

# Use gtable_add_grob to modify original gtab
new_legend <- gtable_add_grob(gtab$grobs[[15]], my_text, 3, 2)

# Update in gtab
gtab$grobs[[15]] <- new_legend

# Draw gtab
grid.draw(gtab)

Data Visualization (ggvis)

The ggvis is based on a “grammar of graphics” and is closely linked to ggplot2 (both designed by Wickham). The objective for ggvis is to combine the analytic power of R with the visual power of Javascript.

Broadly, the “grammar of graphics” includes several layers such as Data, Coordinate System, Marks, Properties, and the like. This is similar to ggplot2 though with a modified syntax that is more in line with dplyr chaining:

myData %>% ggvis(~myX, ~myY, fill = ~myFill, …) %>% layer_myMarkChoice()

Note that the := operator is the static assignment operator, ensuring that a call to “red” means the color “red” and not merely a character vector coerced to the required length with every entry being “red”. The ~ symbolizes that this is a variable in my dataset. So ~red would mean the variable red in the dataset undergoing plotting.

Some basic example code includes:

library(ggvis)
## Warning: package 'ggvis' was built under R version 3.2.5
## 
## Attaching package: 'ggvis'
## The following object is masked from 'package:ggplot2':
## 
##     resolution
data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# Change the code below to make a graph with red points
mtcars %>% ggvis(~wt, ~mpg, fill := "red") %>% layer_points()

# Change the code below draw smooths instead of points
mtcars %>% ggvis(~wt, ~mpg) %>% layer_smooths()

# Change the code below to make a graph containing both points and a smoothed summary line
mtcars %>% ggvis(~wt, ~mpg) %>% layer_points() %>% layer_smooths()

data(pressure)
str(pressure)
## 'data.frame':    19 obs. of  2 variables:
##  $ temperature: num  0 20 40 60 80 100 120 140 160 180 ...
##  $ pressure   : num  0.0002 0.0012 0.006 0.03 0.09 0.27 0.75 1.85 4.2 8.8 ...
# Adapt the code: show bars instead of points
pressure %>% ggvis(~temperature, ~pressure) %>% layer_bars()

# Adapt the codee: show lines instead of points
pressure %>% ggvis(~temperature, ~pressure) %>% layer_lines()

# Extend the code: map the fill property to the temperature variable
pressure %>% ggvis(~temperature, ~pressure, fill=~temperature) %>% layer_points()

# Extend the code: map the size property to the pressure variable
pressure %>% ggvis(~temperature, ~pressure, size=~pressure) %>% layer_points()

There are three main new operators in ggvis (relative to ggplot):

  • %>% (piping from magrittr)
  • ~ (for defining variables as being from the dataset)
  • := (for setting of properties)
    • Can think of having a data space (e.g., species or size) and a visual space (e.g., species is filled by color or size is represented by shape)
    • The straight-up equal sign (=) and the tilde tell ggvis that the variables exists in the data space
    • Using the setting operator (:=) you can make an assignment to something that is not in the data space; for example fill := “red” or size := 100 (purely a visual space assignment)

The line is a special type of mark (second most common after points) - stroke, strokeWidth, strokeOpacity, strokeDash, fill, fillOpacity

Other forms include:

  • Paths, sort of like lines, but connecting points in their same order as the dataset
  • Ribbons, which have bounding
  • Smooths, which have loess (or lm or etc.) added to the data
  • Model predictions, which add a general model prediction

Example code includes:

data(faithful)
str(faithful)
## 'data.frame':    272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...
faithful %>% ggvis(~waiting, ~eruptions) %>% layer_points()

faithful %>% 
    ggvis(~waiting, ~eruptions, size = ~eruptions) %>% 
    layer_points(opacity := 0.5, fill := "blue", stroke := "black")

faithful %>% 
    ggvis(~waiting, ~eruptions, fillOpacity = ~eruptions) %>% 
    layer_points(size := 100, fill := "red", stroke := "red", shape := "cross")

data(pressure)
str(pressure)
## 'data.frame':    19 obs. of  2 variables:
##  $ temperature: num  0 20 40 60 80 100 120 140 160 180 ...
##  $ pressure   : num  0.0002 0.0012 0.006 0.03 0.09 0.27 0.75 1.85 4.2 8.8 ...
# Modify this graph to map the size property to the pressure variable
pressure %>% ggvis(~temperature, ~pressure, size = ~pressure) %>% layer_points()

# Modify this graph by setting the size property
pressure %>% ggvis(~temperature, ~pressure, size := 100) %>% layer_points()

# Fix this code to set the fill property to red
pressure %>% ggvis(~temperature, ~pressure, fill := "red") %>% layer_points()

pressure %>% 
    ggvis(~temperature, ~pressure) %>% 
    layer_lines(stroke := "red", strokeWidth := 2, strokeDash := 6)

# texas %>% ggvis(~long, ~lat) %>% layer_paths(fill := "darkorange")


data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
mtcars %>% compute_smooth(mpg ~ wt)
##       pred_    resp_
## 1  1.513000 32.08897
## 2  1.562506 31.68786
## 3  1.612013 31.28163
## 4  1.661519 30.87037
## 5  1.711025 30.45419
## 6  1.760532 30.03318
## 7  1.810038 29.60745
## 8  1.859544 29.17711
## 9  1.909051 28.74224
## 10 1.958557 28.30017
## 11 2.008063 27.83462
## 12 2.057570 27.34766
## 13 2.107076 26.84498
## 14 2.156582 26.33229
## 15 2.206089 25.81529
## 16 2.255595 25.29968
## 17 2.305101 24.79115
## 18 2.354608 24.29542
## 19 2.404114 23.81818
## 20 2.453620 23.36514
## 21 2.503127 22.95525
## 22 2.552633 22.61385
## 23 2.602139 22.32759
## 24 2.651646 22.08176
## 25 2.701152 21.86167
## 26 2.750658 21.65260
## 27 2.800165 21.43987
## 28 2.849671 21.20875
## 29 2.899177 20.95334
## 30 2.948684 20.71584
## 31 2.998190 20.49571
## 32 3.047696 20.28293
## 33 3.097203 20.06753
## 34 3.146709 19.83950
## 35 3.196215 19.58885
## 36 3.245722 19.29716
## 37 3.295228 18.94441
## 38 3.344734 18.56700
## 39 3.394241 18.20570
## 40 3.443747 17.90090
## 41 3.493253 17.62060
## 42 3.542759 17.34002
## 43 3.592266 17.07908
## 44 3.641772 16.81759
## 45 3.691278 16.55757
## 46 3.740785 16.30833
## 47 3.790291 16.07916
## 48 3.839797 15.87937
## 49 3.889304 15.70181
## 50 3.938810 15.52594
## 51 3.988316 15.35173
## 52 4.037823 15.17933
## 53 4.087329 15.00894
## 54 4.136835 14.84072
## 55 4.186342 14.67484
## 56 4.235848 14.51148
## 57 4.285354 14.35082
## 58 4.334861 14.19302
## 59 4.384367 14.03826
## 60 4.433873 13.88672
## 61 4.483380 13.73856
## 62 4.532886 13.59396
## 63 4.582392 13.45310
## 64 4.631899 13.31614
## 65 4.681405 13.18326
## 66 4.730911 13.05464
## 67 4.780418 12.93045
## 68 4.829924 12.81086
## 69 4.879430 12.69604
## 70 4.928937 12.58617
## 71 4.978443 12.48143
## 72 5.027949 12.38198
## 73 5.077456 12.28799
## 74 5.126962 12.19966
## 75 5.176468 12.11713
## 76 5.225975 12.04060
## 77 5.275481 11.97023
## 78 5.324987 11.90620
## 79 5.374494 11.84868
## 80 5.424000 11.79784
# Extend with ggvis() and layer_lines()
mtcars %>% compute_smooth(mpg ~ wt) %>% ggvis(~pred_, ~resp_) %>% layer_lines()

# Extend with layer_points() and layer_smooths()
mtcars %>% ggvis(~wt, ~mpg) %>% layer_points() %>% layer_smooths()

Behind the scenes, ggvis uses several compute functions to help with visualizations:

  • Using a computation such as a loess (smooth) means that you then map the marks to the smooth data and not the original data
  • The ggvis library only has five key marks - points, paths, ribbons, rects, and text
  • These are extended by a library of compute functions - smooths, model_predictions, bars, histograms, densities, etc.
  • Typically, the compute functions are called automatically by a layer called by the user

The ggvis library is especially well designed to interact with the dplyr library (Hadley Wickham):

  • For example, layer_lines() is just arrange() %>% layer_paths(), since the arrange puts it in the proper x-axis order
  • The dplyr::group_by() offers the opportunities to treat different subsets of the data differently

Example code includes:

data(faithful)
str(faithful)
## 'data.frame':    272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...
faithful %>% ggvis(~waiting) %>% layer_histograms(width = 5)

# Finish the command
faithful %>%
  compute_bin(~waiting, width = 5) %>%
  ggvis(x = ~xmin_, x2 = ~xmax_, y = 0, y2 = ~count_) %>%
  layer_rects()

# Build the density plot
faithful %>% ggvis(~waiting, fill := "green") %>% layer_densities()

data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
mtcars %>% 
  ggvis(x = ~factor(cyl)) %>% 
  layer_bars()

# Instruction 1
mtcars %>% 
    group_by(cyl) %>% 
    ggvis(~mpg, ~wt, stroke = ~factor(cyl)) %>% 
    layer_smooths()
## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to
## character

# Instruction 2
mtcars %>% 
    group_by(cyl) %>% 
    ggvis(~mpg, fill = ~factor(cyl)) %>% 
    layer_densities()
## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to
## character

mtcars %>% 
    group_by(cyl, am) %>% 
    ggvis(~mpg, fill = ~interaction(cyl, am)) %>% 
    layer_densities()
## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to
## character

Can add interactivity to plots in ggvis:

  • For example, a slider to change a key parameter such as “span”
  • Can actually add any portion of ggvis to an interactive user control
  • While the interactive ggvis is active, R will remain busy waiting to re-plot on new user inputs
    • Need to hit escape and/or the red stop sign to return control to R
  • Based on the Shiny framework
    • Can send to other R users
    • Can use Shiny server

Multi-layered ggvis plots:

  • With ggvis, this is as simple as adding the second layer after the first layer
  • Keep adding layers with the piping operator

Example code includes:

data(faithful)
str(faithful)
## 'data.frame':    272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...
data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# Adapt the code: set fill with a select box
faithful %>% 
  ggvis(~waiting, ~eruptions, fillOpacity := 0.5, 
        shape := input_select(label = "Choose shape:", 
                              choices = c("circle", "square", "cross", 
                                          "diamond", "triangle-up", "triangle-down"
                                          )
                              ), 
        fill := input_select(label = "Choose color:", 
                             choices = c("black", "red", "blue", "green")
                             )
        ) %>% 
  layer_points()
## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.

# Add radio buttons to control the fill of the plot
mtcars %>% 
  ggvis(~mpg, ~wt,
        fill := input_radiobuttons(label = "Choose color:", 
                                   choices = c("black", "red", "blue", "green")
                                   )
        ) %>% 
  layer_points()
## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.

mtcars %>% 
  ggvis(~mpg, ~wt, 
        fill := input_text(label = "Choose color:", value = "black")) %>% 
  layer_points()
## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.

# Map the fill property to a select box that returns variable names
mtcars %>% 
  ggvis(~mpg, ~wt, fill = input_select(label = "Choose fill variable:", 
                                       choices = names(mtcars), map=as.name
                                       )
        ) %>% 
  layer_points()
## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.

# Map the bindwidth to a numeric field ("Choose a binwidth:")
mtcars %>% 
  ggvis(~mpg) %>% 
  layer_histograms(width = input_numeric(label = "Choose a binwidth:", value = 1))
## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.

# Map the binwidth to a slider bar ("Choose a binwidth:") with the correct specifications
mtcars %>% 
  ggvis(~mpg) %>% 
  layer_histograms(width = input_slider(label = "Choose a binwidth:", 1, 20))
## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.

# Add a layer of points to the graph below.
pressure %>% 
  ggvis(~temperature, ~pressure, stroke := "skyblue") %>% 
  layer_lines() %>%
  layer_points()

# Copy and adapt so that only the lines layer uses a skyblue stroke.
pressure %>% 
  ggvis(~temperature, ~pressure) %>% 
  layer_lines(stroke := "skyblue") %>%
  layer_points()

# Rewrite the code below so that only the points layer uses the shape property.
pressure %>% 
  ggvis(~temperature, ~pressure) %>% 
  layer_lines(stroke := "skyblue") %>% 
  layer_points(shape := "triangle-up")

# Refactor the code for the graph below to make it as concise as possible
pressure %>% 
  ggvis(~temperature, ~pressure, stroke := "skyblue", strokeOpacity := 0.5, strokeWidth := 5) %>% 
  layer_lines() %>% 
  layer_points(fill = ~temperature, 
              shape := "triangle-up", 
              size := 300)

# Add more layers to the line plot
pressure %>%
  ggvis(~temperature, ~pressure) %>%
  layer_lines(opacity := 0.5) %>%
  layer_points() %>%
  layer_model_predictions(model = "lm", stroke := "navy") %>%
  layer_smooths(stroke := "skyblue")
## Guessing formula = pressure ~ temperature

The add_axis() function can be used to change the titles and axis labels:

  • Can also change the ticks and the location (e.g., top/bottom) of where the axes appear on the plot
  • add_axis(“x”, title = “axis title”, values = c(1, 2, 3), subdivide = 5, orient = “top”)
  • The first argument specifies which axis to customize
    • title - the title of the axis you specified in the first argument
    • values - determine where labelled tick marks will appear on each axis
    • subdivide - insert unlabelled tick marks between the labelled tick marks on an axis
    • orient - control where the axis appears. For the x axis, you can use “top” or “bottom”, for the y axis, you can use “left” or “right”

The add_legends() function can help with cleaning up legends (make them look tidier). This is similar to the arguments passed to the “adding an axis” above.

Can also customize the scales (relationships between data spaces and visual spaces) for the data:

  • Mappings go between the data space (e.g., species) and the visual space (e.g., fill color)
  • The default mapping is handled in ggvis by a function called a scale
  • The scale_nominal() command can change the mapping
    • scale_nominal(“fill”, range = c(“yellow”, “orange”, “red”))

Example code includes:

data(faithful)
str(faithful)
## 'data.frame':    272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...
# Defaulted axis
faithful %>% 
  ggvis(~waiting, ~eruptions) %>% 
  layer_points()

# Customized axis
faithful %>% 
  ggvis(~waiting, ~eruptions) %>% 
  layer_points() %>%
  add_axis("x", title="Time since previous eruption (m)", 
           values=c(50, 60, 70, 80, 90), subdivide=9, orient="top"
           ) %>%
  add_axis("y", title="Duration of eruption (m)", values=c(2, 3, 4, 5), 
           subdivide=9, orient="right"
           )

data(pressure)
str(pressure)
## 'data.frame':    19 obs. of  2 variables:
##  $ temperature: num  0 20 40 60 80 100 120 140 160 180 ...
##  $ pressure   : num  0.0002 0.0012 0.006 0.03 0.09 0.27 0.75 1.85 4.2 8.8 ...
# Add a legend
faithful %>% 
  ggvis(~waiting, ~eruptions, opacity := 0.6, 
        fill = ~factor(round(eruptions))) %>% 
  layer_points() %>%
  add_legend("fill", title="~ duration (m)", orient="left")

# Original code with jumbled legends
faithful %>% 
  ggvis(~waiting, ~eruptions, opacity := 0.6, 
        fill = ~factor(round(eruptions)), shape = ~factor(round(eruptions)), 
        size = ~round(eruptions))  %>%
  layer_points()

# Fix the legend
faithful %>% 
  ggvis(~waiting, ~eruptions, opacity := 0.6, 
        fill = ~factor(round(eruptions)), shape = ~factor(round(eruptions)), 
        size = ~round(eruptions))  %>%
  layer_points() %>%
  add_legend(c("fill", "shape", "size"), title="~ duration (m)")

data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# Add a scale_numeric()
mtcars %>% 
  ggvis(~wt, ~mpg, fill = ~disp, stroke = ~disp, strokeWidth := 2) %>%
  layer_points() %>%
  scale_numeric("fill", range = c("red", "yellow")) %>%
  scale_numeric("stroke", range = c("darkred", "orange"))

# Add a scale_numeric()
mtcars %>% ggvis(~wt, ~mpg, fill = ~hp) %>%
  layer_points() %>%
  scale_numeric("fill", range=c("green", "beige"))

# Add a scale_nominal()
mtcars %>% ggvis(~wt, ~mpg, fill = ~factor(cyl)) %>%
  layer_points() %>%
  scale_nominal("fill", range=c("purple", "blue", "green"))

# Original plot becomes too transparent
mtcars %>% ggvis(x = ~wt, y = ~mpg, fill = ~factor(cyl), opacity = ~hp) %>%
  layer_points()

# Range to prevent overly transparent data points
mtcars %>% ggvis(x = ~wt, y = ~mpg, fill = ~factor(cyl), opacity = ~hp) %>%
  layer_points() %>%
  scale_numeric("opacity", range=c(0.2, 1))

mtcars %>% ggvis(~wt, ~mpg, fill = ~disp) %>%
  layer_points() %>%
  scale_numeric("y", domain = c(0, NA)) %>%  # NA means top-of-data-range
  scale_numeric("x", domain = c(0, 6))

mtcars$color <- c('red' , 'teal' , '#cccccc' , 'tan' , 'red' , 'teal' , '#cccccc' , 'tan' , 
                  'red' , 'teal' , '#cccccc' , 'tan' , 'red' , 'teal' , '#cccccc' , 'tan' , 
                  'red' , 'teal' , '#cccccc' , 'tan' , 'red' , 'teal' , '#cccccc' , 'tan' , 
                  'red' , 'teal' , '#cccccc' , 'tan' , 'red' , 'teal' , '#cccccc' , 'tan'
                  )

# Using fill by mapping the "color" variable to the ggvis scales
mtcars %>% 
  ggvis(x = ~wt, y = ~mpg, fill = ~color) %>% 
  layer_points()

# Using fill based directly on the values in the "color" variable
mtcars %>% 
  ggvis(x = ~wt, y = ~mpg, fill := ~color) %>% 
  layer_points()